On Fri, 12 Aug 2022 23:18:15 +0200
Jiri Olsa <[email protected]> wrote:
> the patch below moves the bpf function into sepatate object and switches
> off the -mrecord-mcount for it.. so the function gets profile call
> generated but it's not visible to ftrace
>
> this seems to work, but it depends on -mrecord-mcount support in gcc and
> it's x86 specific... other archs seems to use -fpatchable-function-entry,
> which does not seem to have option to omit symbol from being collected
> to the section
As I stated. the __mcount_loc section was created by ftrace. It has
nothing to do with -fpatchable-function-entry. It's just that the archs
that use that, do not have a gcc that creates the __mcount_loc section.
>
> disabling specific ftrace symbol with FTRACE_FL_DISABLED flag seems to
> be easir and generic solution.. I'll send RFC for that
It's not easier.
Here, I have a POC for you and some more history.
The recordmcount.pl Perl script was the first to create the
__mcount_loc section in all objects that ftrace needed to hook to. It
did an objdump, found the locations of the calls to mcount, created
another file that had a section __mcount_loc that referenced all those
locations. Compiled and relinked that back into the original object.
This was performed on all object files during the build, and had an
impact on build times. But this is when I also created and introduced
"make localmodconfig", which shrunk the build times for many people, so
nobody noticed the build time increase! :-)
Then John Reiser sent me a patch that created recordmcount.c that did
the same work but directly modified the ELF object files without having
to run scripts. This got rid of that horrible overhead from the scripts.
Then, the gcc folks decided to be helpful here as well and created the
--mrecord-mcount option that would create the __mcount_loc section for
us, where we no longer needed the recordmcount scripts/C program. But
is not available across the board.
Today, objtool has also got involved, and added an "--mcount" option
that will create the section too.
Below is a patch that extends yours by adding a NO_MCOUNT_FILES list,
that you add the object file to and it will prevent the other methods
from adding an mcount_loc location.
I'm adding the objtool folks to make sure this is fine with them.
Again, this is a proof of concept, but works. It may need to be cleaned
a bit before it is final.
-- Steve
Index: linux-trace.git/scripts/Makefile.build
===================================================================
--- linux-trace.git.orig/scripts/Makefile.build
+++ linux-trace.git/scripts/Makefile.build
@@ -206,8 +206,10 @@ sub_cmd_record_mcount = perl $(srctree)/
"$(if $(part-of-module),1,0)" "$(@)";
recordmcount_source := $(srctree)/scripts/recordmcount.pl
endif # BUILD_C_RECORDMCOUNT
-cmd_record_mcount = $(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)), \
+chk_sub_cmd_record_mcount = $(if $(filter $(shell basename $@),$(NO_MCOUNT_FILES)),, \
$(sub_cmd_record_mcount))
+cmd_record_mcount = $(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)), \
+ $(chk_sub_cmd_record_mcount))
endif # CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT
# 'OBJECT_FILES_NON_STANDARD := y': skip objtool checking for a directory
Index: linux-trace.git/scripts/Makefile.lib
===================================================================
--- linux-trace.git.orig/scripts/Makefile.lib
+++ linux-trace.git/scripts/Makefile.lib
@@ -233,7 +233,8 @@ objtool_args = \
$(if $(CONFIG_HAVE_JUMP_LABEL_HACK), --hacks=jump_label) \
$(if $(CONFIG_HAVE_NOINSTR_HACK), --hacks=noinstr) \
$(if $(CONFIG_X86_KERNEL_IBT), --ibt) \
- $(if $(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL), --mcount) \
+ $(if $(filter $(shell basename $@),$(NO_MCOUNT_FILES)),, \
+ $(if $(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL), --mcount)) \
$(if $(CONFIG_UNWINDER_ORC), --orc) \
$(if $(CONFIG_RETPOLINE), --retpoline) \
$(if $(CONFIG_SLS), --sls) \
Index: linux-trace.git/net/core/Makefile
===================================================================
--- linux-trace.git.orig/net/core/Makefile
+++ linux-trace.git/net/core/Makefile
@@ -11,10 +11,15 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.
obj-y += dev.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
- fib_notifier.o xdp.o flow_offload.o gro.o
+ fib_notifier.o xdp.o flow_offload.o gro.o \
+ dispatcher.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
+# remove dispatcher function from ftrace sight
+CFLAGS_REMOVE_dispatcher.o = -mrecord-mcount
+NO_MCOUNT_FILES += dispatcher.o
+
obj-y += net-sysfs.o
obj-$(CONFIG_PAGE_POOL) += page_pool.o
obj-$(CONFIG_PROC_FS) += net-procfs.o
Index: linux-trace.git/net/core/dispatcher.c
===================================================================
--- /dev/null
+++ linux-trace.git/net/core/dispatcher.c
@@ -0,0 +1,3 @@
+#include <linux/bpf.h>
+
+DEFINE_BPF_DISPATCHER(xdp)
Index: linux-trace.git/net/core/filter.c
===================================================================
--- linux-trace.git.orig/net/core/filter.c
+++ linux-trace.git/net/core/filter.c
@@ -11162,8 +11162,6 @@ const struct bpf_verifier_ops sk_lookup_
#endif /* CONFIG_INET */
-DEFINE_BPF_DISPATCHER(xdp)
-
void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog)
{
bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp), prev_prog, prog);
On Sat, 13 Aug 2022 15:02:52 -0400
Steven Rostedt <[email protected]> wrote:
> Index: linux-trace.git/scripts/Makefile.lib
> ===================================================================
> --- linux-trace.git.orig/scripts/Makefile.lib
> +++ linux-trace.git/scripts/Makefile.lib
> @@ -233,7 +233,8 @@ objtool_args = \
> $(if $(CONFIG_HAVE_JUMP_LABEL_HACK), --hacks=jump_label) \
> $(if $(CONFIG_HAVE_NOINSTR_HACK), --hacks=noinstr) \
> $(if $(CONFIG_X86_KERNEL_IBT), --ibt) \
> - $(if $(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL), --mcount) \
> + $(if $(filter $(shell basename $@),$(NO_MCOUNT_FILES)),, \
> + $(if $(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL), --mcount)) \
I believe there's some security and other validations that objtool does
that requires it to know about the mcount locations.
If BPF is doing something unique, and modifying code as well (outside
the jump label and ftrace work), does objtool need to know about that too?
-- Steve
> $(if $(CONFIG_UNWINDER_ORC), --orc) \
> $(if $(CONFIG_RETPOLINE), --retpoline) \
> $(if $(CONFIG_SLS), --sls) \
On Sat, Aug 13, 2022 at 03:02:52PM -0400, Steven Rostedt wrote:
> On Fri, 12 Aug 2022 23:18:15 +0200
> Jiri Olsa <[email protected]> wrote:
>
> > the patch below moves the bpf function into sepatate object and switches
> > off the -mrecord-mcount for it.. so the function gets profile call
> > generated but it's not visible to ftrace
> >
> > this seems to work, but it depends on -mrecord-mcount support in gcc and
> > it's x86 specific... other archs seems to use -fpatchable-function-entry,
> > which does not seem to have option to omit symbol from being collected
> > to the section
>
> As I stated. the __mcount_loc section was created by ftrace. It has
> nothing to do with -fpatchable-function-entry. It's just that the archs
> that use that, do not have a gcc that creates the __mcount_loc section.
>
> >
> > disabling specific ftrace symbol with FTRACE_FL_DISABLED flag seems to
> > be easir and generic solution.. I'll send RFC for that
>
> It's not easier.
>
> Here, I have a POC for you and some more history.
>
> The recordmcount.pl Perl script was the first to create the
> __mcount_loc section in all objects that ftrace needed to hook to. It
> did an objdump, found the locations of the calls to mcount, created
> another file that had a section __mcount_loc that referenced all those
> locations. Compiled and relinked that back into the original object.
>
> This was performed on all object files during the build, and had an
> impact on build times. But this is when I also created and introduced
> "make localmodconfig", which shrunk the build times for many people, so
> nobody noticed the build time increase! :-)
>
> Then John Reiser sent me a patch that created recordmcount.c that did
> the same work but directly modified the ELF object files without having
> to run scripts. This got rid of that horrible overhead from the scripts.
>
> Then, the gcc folks decided to be helpful here as well and created the
> --mrecord-mcount option that would create the __mcount_loc section for
> us, where we no longer needed the recordmcount scripts/C program. But
> is not available across the board.
>
> Today, objtool has also got involved, and added an "--mcount" option
> that will create the section too.
I overlooked that objtool is involved as well,
will check on that
>
> Below is a patch that extends yours by adding a NO_MCOUNT_FILES list,
> that you add the object file to and it will prevent the other methods
> from adding an mcount_loc location.
thanks,
jirka
>
> I'm adding the objtool folks to make sure this is fine with them.
> Again, this is a proof of concept, but works. It may need to be cleaned
> a bit before it is final.
>
> -- Steve
>
> Index: linux-trace.git/scripts/Makefile.build
> ===================================================================
> --- linux-trace.git.orig/scripts/Makefile.build
> +++ linux-trace.git/scripts/Makefile.build
> @@ -206,8 +206,10 @@ sub_cmd_record_mcount = perl $(srctree)/
> "$(if $(part-of-module),1,0)" "$(@)";
> recordmcount_source := $(srctree)/scripts/recordmcount.pl
> endif # BUILD_C_RECORDMCOUNT
> -cmd_record_mcount = $(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)), \
> +chk_sub_cmd_record_mcount = $(if $(filter $(shell basename $@),$(NO_MCOUNT_FILES)),, \
> $(sub_cmd_record_mcount))
> +cmd_record_mcount = $(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)), \
> + $(chk_sub_cmd_record_mcount))
> endif # CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT
>
> # 'OBJECT_FILES_NON_STANDARD := y': skip objtool checking for a directory
> Index: linux-trace.git/scripts/Makefile.lib
> ===================================================================
> --- linux-trace.git.orig/scripts/Makefile.lib
> +++ linux-trace.git/scripts/Makefile.lib
> @@ -233,7 +233,8 @@ objtool_args = \
> $(if $(CONFIG_HAVE_JUMP_LABEL_HACK), --hacks=jump_label) \
> $(if $(CONFIG_HAVE_NOINSTR_HACK), --hacks=noinstr) \
> $(if $(CONFIG_X86_KERNEL_IBT), --ibt) \
> - $(if $(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL), --mcount) \
> + $(if $(filter $(shell basename $@),$(NO_MCOUNT_FILES)),, \
> + $(if $(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL), --mcount)) \
> $(if $(CONFIG_UNWINDER_ORC), --orc) \
> $(if $(CONFIG_RETPOLINE), --retpoline) \
> $(if $(CONFIG_SLS), --sls) \
> Index: linux-trace.git/net/core/Makefile
> ===================================================================
> --- linux-trace.git.orig/net/core/Makefile
> +++ linux-trace.git/net/core/Makefile
> @@ -11,10 +11,15 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.
> obj-y += dev.o dev_addr_lists.o dst.o netevent.o \
> neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
> sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
> - fib_notifier.o xdp.o flow_offload.o gro.o
> + fib_notifier.o xdp.o flow_offload.o gro.o \
> + dispatcher.o
>
> obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
>
> +# remove dispatcher function from ftrace sight
> +CFLAGS_REMOVE_dispatcher.o = -mrecord-mcount
> +NO_MCOUNT_FILES += dispatcher.o
> +
> obj-y += net-sysfs.o
> obj-$(CONFIG_PAGE_POOL) += page_pool.o
> obj-$(CONFIG_PROC_FS) += net-procfs.o
> Index: linux-trace.git/net/core/dispatcher.c
> ===================================================================
> --- /dev/null
> +++ linux-trace.git/net/core/dispatcher.c
> @@ -0,0 +1,3 @@
> +#include <linux/bpf.h>
> +
> +DEFINE_BPF_DISPATCHER(xdp)
> Index: linux-trace.git/net/core/filter.c
> ===================================================================
> --- linux-trace.git.orig/net/core/filter.c
> +++ linux-trace.git/net/core/filter.c
> @@ -11162,8 +11162,6 @@ const struct bpf_verifier_ops sk_lookup_
>
> #endif /* CONFIG_INET */
>
> -DEFINE_BPF_DISPATCHER(xdp)
> -
> void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog)
> {
> bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp), prev_prog, prog);
>
On Sat, Aug 13, 2022 at 03:02:52PM -0400, Steven Rostedt wrote:
> On Fri, 12 Aug 2022 23:18:15 +0200
> Jiri Olsa <[email protected]> wrote:
>
> > the patch below moves the bpf function into sepatate object and switches
> > off the -mrecord-mcount for it.. so the function gets profile call
> > generated but it's not visible to ftrace
Why ?!?
On Mon, Aug 15, 2022 at 10:03:17AM +0200, Peter Zijlstra wrote:
> On Sat, Aug 13, 2022 at 03:02:52PM -0400, Steven Rostedt wrote:
> > On Fri, 12 Aug 2022 23:18:15 +0200
> > Jiri Olsa <[email protected]> wrote:
> >
> > > the patch below moves the bpf function into sepatate object and switches
> > > off the -mrecord-mcount for it.. so the function gets profile call
> > > generated but it's not visible to ftrace
>
> Why ?!?
there's bpf dispatcher code that updates bpf_dispatcher_xdp_func
function with bpf_arch_text_poke and that can race with ftrace update
if the function is traced
the idea to solve it is to 'mark' the function independent of ftrace,
and add a way to make the function invissible to ftrace but with the
profile code fentry call generated
jirka
On Mon, Aug 15, 2022 at 11:44:32AM +0200, Jiri Olsa wrote:
> On Mon, Aug 15, 2022 at 10:03:17AM +0200, Peter Zijlstra wrote:
> > On Sat, Aug 13, 2022 at 03:02:52PM -0400, Steven Rostedt wrote:
> > > On Fri, 12 Aug 2022 23:18:15 +0200
> > > Jiri Olsa <[email protected]> wrote:
> > >
> > > > the patch below moves the bpf function into sepatate object and switches
> > > > off the -mrecord-mcount for it.. so the function gets profile call
> > > > generated but it's not visible to ftrace
> >
> > Why ?!?
>
> there's bpf dispatcher code that updates bpf_dispatcher_xdp_func
> function with bpf_arch_text_poke and that can race with ftrace update
> if the function is traced
I thought bpf_arch_text_poke() wasn't allowed to touch kernel code and
ftrace is in full control of it ?
On Mon, Aug 15, 2022 at 5:37 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 15, 2022 at 11:44:32AM +0200, Jiri Olsa wrote:
> > On Mon, Aug 15, 2022 at 10:03:17AM +0200, Peter Zijlstra wrote:
> > > On Sat, Aug 13, 2022 at 03:02:52PM -0400, Steven Rostedt wrote:
> > > > On Fri, 12 Aug 2022 23:18:15 +0200
> > > > Jiri Olsa <[email protected]> wrote:
> > > >
> > > > > the patch below moves the bpf function into sepatate object and switches
> > > > > off the -mrecord-mcount for it.. so the function gets profile call
> > > > > generated but it's not visible to ftrace
> > >
> > > Why ?!?
> >
> > there's bpf dispatcher code that updates bpf_dispatcher_xdp_func
> > function with bpf_arch_text_poke and that can race with ftrace update
> > if the function is traced
>
> I thought bpf_arch_text_poke() wasn't allowed to touch kernel code and
> ftrace is in full control of it ?
ftrace is not in "full control" of nop5 and must not be.
Soon we will have nop5 in the middle of the function.
ftrace must not touch it.
On Mon, Aug 15, 2022 at 07:25:24AM -0700, Alexei Starovoitov wrote:
> On Mon, Aug 15, 2022 at 5:37 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Aug 15, 2022 at 11:44:32AM +0200, Jiri Olsa wrote:
> > > On Mon, Aug 15, 2022 at 10:03:17AM +0200, Peter Zijlstra wrote:
> > > > On Sat, Aug 13, 2022 at 03:02:52PM -0400, Steven Rostedt wrote:
> > > > > On Fri, 12 Aug 2022 23:18:15 +0200
> > > > > Jiri Olsa <[email protected]> wrote:
> > > > >
> > > > > > the patch below moves the bpf function into sepatate object and switches
> > > > > > off the -mrecord-mcount for it.. so the function gets profile call
> > > > > > generated but it's not visible to ftrace
> > > >
> > > > Why ?!?
> > >
> > > there's bpf dispatcher code that updates bpf_dispatcher_xdp_func
> > > function with bpf_arch_text_poke and that can race with ftrace update
> > > if the function is traced
> >
> > I thought bpf_arch_text_poke() wasn't allowed to touch kernel code and
> > ftrace is in full control of it ?
>
> ftrace is not in "full control" of nop5 and must not be.
It is in full control of the 'call __fentry__'. Absolute full NAK on you
trying to make it otherwise.
> Soon we will have nop5 in the middle of the function.
> ftrace must not touch it.
How are you generating that NOP and what for?
On Mon, Aug 15, 2022 at 7:33 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 15, 2022 at 07:25:24AM -0700, Alexei Starovoitov wrote:
> > On Mon, Aug 15, 2022 at 5:37 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Mon, Aug 15, 2022 at 11:44:32AM +0200, Jiri Olsa wrote:
> > > > On Mon, Aug 15, 2022 at 10:03:17AM +0200, Peter Zijlstra wrote:
> > > > > On Sat, Aug 13, 2022 at 03:02:52PM -0400, Steven Rostedt wrote:
> > > > > > On Fri, 12 Aug 2022 23:18:15 +0200
> > > > > > Jiri Olsa <[email protected]> wrote:
> > > > > >
> > > > > > > the patch below moves the bpf function into sepatate object and switches
> > > > > > > off the -mrecord-mcount for it.. so the function gets profile call
> > > > > > > generated but it's not visible to ftrace
> > > > >
> > > > > Why ?!?
> > > >
> > > > there's bpf dispatcher code that updates bpf_dispatcher_xdp_func
> > > > function with bpf_arch_text_poke and that can race with ftrace update
> > > > if the function is traced
> > >
> > > I thought bpf_arch_text_poke() wasn't allowed to touch kernel code and
> > > ftrace is in full control of it ?
> >
> > ftrace is not in "full control" of nop5 and must not be.
>
> It is in full control of the 'call __fentry__'. Absolute full NAK on you
> trying to make it otherwise.
Don't mix 'call fentry' generated by the compiler with nop5 inserted
by macroses or JITs.
> > Soon we will have nop5 in the middle of the function.
> > ftrace must not touch it.
>
> How are you generating that NOP and what for?
We're generating nop5-s in JITed code to further
attach to. For example when one bpf prog is being replaced by another.
Currently it's in the func prologue only.
In the future it will be anywhere in the body.
On Mon, Aug 15, 2022 at 07:45:16AM -0700, Alexei Starovoitov wrote:
> On Mon, Aug 15, 2022 at 7:33 AM Peter Zijlstra <[email protected]> wrote:
> > It is in full control of the 'call __fentry__'. Absolute full NAK on you
> > trying to make it otherwise.
>
> Don't mix 'call fentry' generated by the compiler with nop5 inserted
> by macroses or JITs.
Looking at:
https://lore.kernel.org/all/[email protected]/
this seems to want to prod at the __fentry__ sites.
> > > Soon we will have nop5 in the middle of the function.
> > > ftrace must not touch it.
> >
> > How are you generating that NOP and what for?
>
> We're generating nop5-s in JITed code to further
> attach to.
Ftrace doesn't know about those; so how can it break that?
Likewise it doesn't know about the static_branch/static_call NOPs and
nothing is broken.
Ftrace only knows about the __fentry__ sites, and it *does* own them.
Are you saying ftrace is writing to a code location not a __fentry__
site?
On Mon, Aug 15, 2022 at 8:02 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 15, 2022 at 07:45:16AM -0700, Alexei Starovoitov wrote:
> > On Mon, Aug 15, 2022 at 7:33 AM Peter Zijlstra <[email protected]> wrote:
>
> > > It is in full control of the 'call __fentry__'. Absolute full NAK on you
> > > trying to make it otherwise.
> >
> > Don't mix 'call fentry' generated by the compiler with nop5 inserted
> > by macroses or JITs.
>
> Looking at:
>
> https://lore.kernel.org/all/[email protected]/
>
> this seems to want to prod at the __fentry__ sites.
>
> > > > Soon we will have nop5 in the middle of the function.
> > > > ftrace must not touch it.
> > >
> > > How are you generating that NOP and what for?
> >
> > We're generating nop5-s in JITed code to further
> > attach to.
>
> Ftrace doesn't know about those; so how can it break that?
>
> Likewise it doesn't know about the static_branch/static_call NOPs and
> nothing is broken.
>
> Ftrace only knows about the __fentry__ sites, and it *does* own them.
> Are you saying ftrace is writing to a code location not a __fentry__
> site?
Let's keep it in one thread:
> It wasn't long before. Yes it landed a few months prior to the
> static_call work, but the whole static_call thing was in progress for a
> long long time.
>
> Anyway, yes it is different. But it's still very much broken. You simply
> cannot step on __fentry__ sites like that.
Ask yourself: should static_call patching logic go through
ftrace infra ? No. Right?
static_call has nothing to do with ftrace (function tracing).
Same thing here. bpf dispatching logic is nothing to do with
function tracing.
In this case bpf_dispatcher_xdp_func is a placeholder written C.
If it was written in asm, fentry recording wouldn't have known about it.
And that's more or less what Jiri patch is doing.
It's hiding a fake function from ftrace, since it's not a function
and ftrace infra shouldn't show it tracing logs.
In other words it's a _notrace_ function with nop5.
On Mon, 15 Aug 2022 08:17:42 -0700
Alexei Starovoitov <[email protected]> wrote:
> Ask yourself: should static_call patching logic go through
> ftrace infra ? No. Right?
I agree that static_call (and jump_labels) are not part of the ftrace
infrastructure (but ftrace was a strong motivator for those).
> static_call has nothing to do with ftrace (function tracing).
Besides the motivation, I agree.
> Same thing here. bpf dispatching logic is nothing to do with
> function tracing.
But it used fentry, which is part of function tracing. Which is what I'm
against. And why it broke ftrace.
> In this case bpf_dispatcher_xdp_func is a placeholder written C.
> If it was written in asm, fentry recording wouldn't have known about it.
And I would not have had an issue with that approach (for ftrace that is).
But that brings up other concerns (see below).
> And that's more or less what Jiri patch is doing.
> It's hiding a fake function from ftrace, since it's not a function
> and ftrace infra shouldn't show it tracing logs.
> In other words it's a _notrace_ function with nop5.
On the ftrace side, I'm perfectly happy with Jiri's approach (the one I
help extend).
But dynamic code modification is something we need to take very seriously.
It's very similar to writing your own locking primitives (which Linus
always says "Don't do"). It's complex and easy to get wrong. The more
dynamic code modifications we have, the less secure the kernel is.
Here's the list of dynamic code modification infrastructures:
ftrace
kprobes
jump_labels
static_calls
We now have the bpf dispatcher.
The ftrace, kprobes, jump_labels and static_calls developers work together
to make sure that we are all in line, not breaking anything, and try to
consolidate when possible. We also review each others code.
The issue I have is that BPF is largely doing it alone, and not
communicating with the others. This gives me cause for concern on both a
robustness and security point of view.
-- Steve
On Mon, Aug 15, 2022 at 08:17:42AM -0700, Alexei Starovoitov wrote:
> It's hiding a fake function from ftrace, since it's not a function
> and ftrace infra shouldn't show it tracing logs.
> In other words it's a _notrace_ function with nop5.
Then make it a notrace function with a nop5 in it. That isn't hard.
The whole problem is that it isn't a notrace function and you're abusing
a __fentry__ site.
On Mon, Aug 15, 2022 at 05:28:02PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 15, 2022 at 08:17:42AM -0700, Alexei Starovoitov wrote:
> > It's hiding a fake function from ftrace, since it's not a function
> > and ftrace infra shouldn't show it tracing logs.
> > In other words it's a _notrace_ function with nop5.
>
> Then make it a notrace function with a nop5 in it. That isn't hard.
>
> The whole problem is that it isn't a notrace function and you're abusing
> a __fentry__ site.
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/fineibt&id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
foo.c:
__attribute__((__no_instrument_function__))
__attribute__((patchable_function_entry(5)))
void my_func(void)
{
}
void my_foo(void)
{
}
gcc -c foo.c -pg -mfentry -mcmodel=kernel -fno-PIE -O2
foo.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <my_func>:
0: f3 0f 1e fa endbr64
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 90 nop
9: c3 ret
a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000010 <my_foo>:
10: f3 0f 1e fa endbr64
14: e8 00 00 00 00 call 19 <my_foo+0x9> 15: R_X86_64_PLT32 __fentry__-0x4
19: c3 ret
On Mon, Aug 15, 2022 at 8:41 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 15, 2022 at 05:28:02PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 15, 2022 at 08:17:42AM -0700, Alexei Starovoitov wrote:
> > > It's hiding a fake function from ftrace, since it's not a function
> > > and ftrace infra shouldn't show it tracing logs.
> > > In other words it's a _notrace_ function with nop5.
> >
> > Then make it a notrace function with a nop5 in it. That isn't hard.
> >
> > The whole problem is that it isn't a notrace function and you're abusing
> > a __fentry__ site.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/fineibt&id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
Brand new stuff.
Awesome. That should fit perfectly.
> foo.c:
>
> __attribute__((__no_instrument_function__))
> __attribute__((patchable_function_entry(5)))
Interesting. Didn't know about this attribute.
> void my_func(void)
> {
> }
>
> void my_foo(void)
> {
> }
Great.
Jiri, could you please revise your patch with this approach?
On Mon, 15 Aug 2022 08:49:11 -0700
Alexei Starovoitov <[email protected]> wrote:
> > > The whole problem is that it isn't a notrace function and you're abusing
> > > a __fentry__ site.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/fineibt&id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
>
> Brand new stuff.
> Awesome. That should fit perfectly.
>
> > foo.c:
> >
> > __attribute__((__no_instrument_function__))
> > __attribute__((patchable_function_entry(5)))
>
> Interesting. Didn't know about this attribute.
>
> > void my_func(void)
> > {
> > }
> >
> > void my_foo(void)
> > {
> > }
>
> Great.
> Jiri, could you please revise your patch with this approach?
This is the exact result I was looking for. Something we can all agree to.
The point being, include others when developing code that is similar to
what other subsystems do. On the code modification front, please Cc the
ftrace, kprobe, static_call and jump_label maintainers, as we like to work
together. The BPF dispatcher modifications should be no different. There's
a lot of experience in this field throughout the kernel. Please utilize it.
If it wasn't for this thread, we would never had found out about this easy
solution.
-- Steve
On Mon, Aug 15, 2022 at 05:41:27PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 15, 2022 at 05:28:02PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 15, 2022 at 08:17:42AM -0700, Alexei Starovoitov wrote:
> > > It's hiding a fake function from ftrace, since it's not a function
> > > and ftrace infra shouldn't show it tracing logs.
> > > In other words it's a _notrace_ function with nop5.
> >
> > Then make it a notrace function with a nop5 in it. That isn't hard.
> >
> > The whole problem is that it isn't a notrace function and you're abusing
> > a __fentry__ site.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/fineibt&id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
>
> foo.c:
>
> __attribute__((__no_instrument_function__))
> __attribute__((patchable_function_entry(5)))
> void my_func(void)
> {
> }
>
> void my_foo(void)
> {
> }
>
> gcc -c foo.c -pg -mfentry -mcmodel=kernel -fno-PIE -O2
>
> foo.o: file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <my_func>:
> 0: f3 0f 1e fa endbr64
> 4: 90 nop
> 5: 90 nop
> 6: 90 nop
> 7: 90 nop
> 8: 90 nop
> 9: c3 ret
> a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
>
> 0000000000000010 <my_foo>:
> 10: f3 0f 1e fa endbr64
> 14: e8 00 00 00 00 call 19 <my_foo+0x9> 15: R_X86_64_PLT32 __fentry__-0x4
> 19: c3 ret
>
ok, so the problem with __attribute__((patchable_function_entry(5))) is that
it puts function address into __patchable_function_entries section, which is
one of ftrace locations source:
#define MCOUNT_REC() . = ALIGN(8); \
__start_mcount_loc = .; \
KEEP(*(__mcount_loc)) \
KEEP(*(__patchable_function_entries)) \
__stop_mcount_loc = .; \
...
it looks like __patchable_function_entries is used for other than x86 archs,
so we perhaps we could have x86 specific MCOUNT_REC macro just with
__mcount_loc section?
jirka
On Thu, 18 Aug 2022 22:27:07 +0200
Jiri Olsa <[email protected]> wrote:
> ok, so the problem with __attribute__((patchable_function_entry(5))) is that
> it puts function address into __patchable_function_entries section, which is
> one of ftrace locations source:
>
> #define MCOUNT_REC() . = ALIGN(8); \
> __start_mcount_loc = .; \
> KEEP(*(__mcount_loc)) \
> KEEP(*(__patchable_function_entries)) \
> __stop_mcount_loc = .; \
> ...
>
>
> it looks like __patchable_function_entries is used for other than x86 archs,
> so we perhaps we could have x86 specific MCOUNT_REC macro just with
> __mcount_loc section?
So something like this:
#ifdef CONFIG_X86
# define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
# define MCOUNT_PATCHABLE
#else
# define NON_MCOUNT_PATCHABLE
# define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
#endif
#define MCOUNT_REC() . = ALIGN(8); \
__start_mcount_loc = .; \
KEEP(*(__mcount_loc)) \
MCOUNT_PATCHABLE \
__stop_mcount_loc = .; \
NON_MCOUNT_PATCHABLE \
...
??
-- Steve
On Thu, Aug 18, 2022 at 1:50 PM Steven Rostedt <[email protected]> wrote:
>
> On Thu, 18 Aug 2022 22:27:07 +0200
> Jiri Olsa <[email protected]> wrote:
>
> > ok, so the problem with __attribute__((patchable_function_entry(5))) is that
> > it puts function address into __patchable_function_entries section, which is
> > one of ftrace locations source:
> >
> > #define MCOUNT_REC() . = ALIGN(8); \
> > __start_mcount_loc = .; \
> > KEEP(*(__mcount_loc)) \
> > KEEP(*(__patchable_function_entries)) \
> > __stop_mcount_loc = .; \
> > ...
> >
> >
> > it looks like __patchable_function_entries is used for other than x86 archs,
> > so we perhaps we could have x86 specific MCOUNT_REC macro just with
> > __mcount_loc section?
>
> So something like this:
>
> #ifdef CONFIG_X86
> # define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> # define MCOUNT_PATCHABLE
> #else
> # define NON_MCOUNT_PATCHABLE
> # define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> #endif
>
> #define MCOUNT_REC() . = ALIGN(8); \
> __start_mcount_loc = .; \
> KEEP(*(__mcount_loc)) \
> MCOUNT_PATCHABLE \
> __stop_mcount_loc = .; \
> NON_MCOUNT_PATCHABLE \
> ...
>
> ??
That's what more or less Peter's patch is doing:
Here it is again for reference:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
On Thu, 18 Aug 2022 14:00:21 -0700
Alexei Starovoitov <[email protected]> wrote:
> > #ifdef CONFIG_X86
> > # define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > # define MCOUNT_PATCHABLE
> > #else
> > # define NON_MCOUNT_PATCHABLE
> > # define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > #endif
> >
> > #define MCOUNT_REC() . = ALIGN(8); \
> > __start_mcount_loc = .; \
> > KEEP(*(__mcount_loc)) \
> > MCOUNT_PATCHABLE \
> > __stop_mcount_loc = .; \
> > NON_MCOUNT_PATCHABLE \
> > ...
> >
> > ??
>
> That's what more or less Peter's patch is doing:
Heh.
> Here it is again for reference:
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
Thanks, I missed seeing this.
-- Steve
On Thu, Aug 18, 2022 at 04:50:24PM -0400, Steven Rostedt wrote:
> On Thu, 18 Aug 2022 22:27:07 +0200
> Jiri Olsa <[email protected]> wrote:
>
> > ok, so the problem with __attribute__((patchable_function_entry(5))) is that
> > it puts function address into __patchable_function_entries section, which is
> > one of ftrace locations source:
> >
> > #define MCOUNT_REC() . = ALIGN(8); \
> > __start_mcount_loc = .; \
> > KEEP(*(__mcount_loc)) \
> > KEEP(*(__patchable_function_entries)) \
> > __stop_mcount_loc = .; \
> > ...
> >
> >
> > it looks like __patchable_function_entries is used for other than x86 archs,
> > so we perhaps we could have x86 specific MCOUNT_REC macro just with
> > __mcount_loc section?
>
> So something like this:
>
> #ifdef CONFIG_X86
> # define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> # define MCOUNT_PATCHABLE
> #else
> # define NON_MCOUNT_PATCHABLE
> # define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> #endif
>
> #define MCOUNT_REC() . = ALIGN(8); \
> __start_mcount_loc = .; \
> KEEP(*(__mcount_loc)) \
> MCOUNT_PATCHABLE \
> __stop_mcount_loc = .; \
> NON_MCOUNT_PATCHABLE \
> ...
>
is there a reason to keep NON_MCOUNT_PATCHABLE section for x86? otherwise LGTM
jirka
On Thu, Aug 18, 2022 at 02:00:21PM -0700, Alexei Starovoitov wrote:
> On Thu, Aug 18, 2022 at 1:50 PM Steven Rostedt <[email protected]> wrote:
> >
> > On Thu, 18 Aug 2022 22:27:07 +0200
> > Jiri Olsa <[email protected]> wrote:
> >
> > > ok, so the problem with __attribute__((patchable_function_entry(5))) is that
> > > it puts function address into __patchable_function_entries section, which is
> > > one of ftrace locations source:
> > >
> > > #define MCOUNT_REC() . = ALIGN(8); \
> > > __start_mcount_loc = .; \
> > > KEEP(*(__mcount_loc)) \
> > > KEEP(*(__patchable_function_entries)) \
> > > __stop_mcount_loc = .; \
> > > ...
> > >
> > >
> > > it looks like __patchable_function_entries is used for other than x86 archs,
> > > so we perhaps we could have x86 specific MCOUNT_REC macro just with
> > > __mcount_loc section?
> >
> > So something like this:
> >
> > #ifdef CONFIG_X86
> > # define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > # define MCOUNT_PATCHABLE
> > #else
> > # define NON_MCOUNT_PATCHABLE
> > # define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > #endif
> >
> > #define MCOUNT_REC() . = ALIGN(8); \
> > __start_mcount_loc = .; \
> > KEEP(*(__mcount_loc)) \
> > MCOUNT_PATCHABLE \
> > __stop_mcount_loc = .; \
> > NON_MCOUNT_PATCHABLE \
> > ...
> >
> > ??
>
> That's what more or less Peter's patch is doing:
> Here it is again for reference:
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
ah nice, and discards the __patchable_function_entries section, great
jirka
On Thu, Aug 18, 2022 at 11:32:55PM +0200, Jiri Olsa wrote:
> On Thu, Aug 18, 2022 at 02:00:21PM -0700, Alexei Starovoitov wrote:
> > On Thu, Aug 18, 2022 at 1:50 PM Steven Rostedt <[email protected]> wrote:
> > >
> > > On Thu, 18 Aug 2022 22:27:07 +0200
> > > Jiri Olsa <[email protected]> wrote:
> > >
> > > > ok, so the problem with __attribute__((patchable_function_entry(5))) is that
> > > > it puts function address into __patchable_function_entries section, which is
> > > > one of ftrace locations source:
> > > >
> > > > #define MCOUNT_REC() . = ALIGN(8); \
> > > > __start_mcount_loc = .; \
> > > > KEEP(*(__mcount_loc)) \
> > > > KEEP(*(__patchable_function_entries)) \
> > > > __stop_mcount_loc = .; \
> > > > ...
> > > >
> > > >
> > > > it looks like __patchable_function_entries is used for other than x86 archs,
> > > > so we perhaps we could have x86 specific MCOUNT_REC macro just with
> > > > __mcount_loc section?
> > >
> > > So something like this:
> > >
> > > #ifdef CONFIG_X86
> > > # define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > > # define MCOUNT_PATCHABLE
> > > #else
> > > # define NON_MCOUNT_PATCHABLE
> > > # define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > > #endif
> > >
> > > #define MCOUNT_REC() . = ALIGN(8); \
> > > __start_mcount_loc = .; \
> > > KEEP(*(__mcount_loc)) \
> > > MCOUNT_PATCHABLE \
> > > __stop_mcount_loc = .; \
> > > NON_MCOUNT_PATCHABLE \
> > > ...
> > >
> > > ??
> >
> > That's what more or less Peter's patch is doing:
> > Here it is again for reference:
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
>
> ah nice, and discards the __patchable_function_entries section, great
>
tested change below with Peter's change above and it seems to work,
once it get merged I'll send full patch
thanks,
jirka
---
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 39bd36359c1e..39b6807058e9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -924,6 +924,8 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs);
}
#define DEFINE_BPF_DISPATCHER(name) \
+ __attribute__((__no_instrument_function__)) \
+ __attribute__((patchable_function_entry(5))) \
noinline __nocfi unsigned int bpf_dispatcher_##name##_func( \
const void *ctx, \
const struct bpf_insn *insnsi, \
On Fri, Aug 19, 2022 at 4:45 AM Jiri Olsa <[email protected]> wrote:
>
> On Thu, Aug 18, 2022 at 11:32:55PM +0200, Jiri Olsa wrote:
> > On Thu, Aug 18, 2022 at 02:00:21PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Aug 18, 2022 at 1:50 PM Steven Rostedt <[email protected]> wrote:
> > > >
> > > > On Thu, 18 Aug 2022 22:27:07 +0200
> > > > Jiri Olsa <[email protected]> wrote:
> > > >
> > > > > ok, so the problem with __attribute__((patchable_function_entry(5))) is that
> > > > > it puts function address into __patchable_function_entries section, which is
> > > > > one of ftrace locations source:
> > > > >
> > > > > #define MCOUNT_REC() . = ALIGN(8); \
> > > > > __start_mcount_loc = .; \
> > > > > KEEP(*(__mcount_loc)) \
> > > > > KEEP(*(__patchable_function_entries)) \
> > > > > __stop_mcount_loc = .; \
> > > > > ...
> > > > >
> > > > >
> > > > > it looks like __patchable_function_entries is used for other than x86 archs,
> > > > > so we perhaps we could have x86 specific MCOUNT_REC macro just with
> > > > > __mcount_loc section?
> > > >
> > > > So something like this:
> > > >
> > > > #ifdef CONFIG_X86
> > > > # define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > > > # define MCOUNT_PATCHABLE
> > > > #else
> > > > # define NON_MCOUNT_PATCHABLE
> > > > # define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > > > #endif
> > > >
> > > > #define MCOUNT_REC() . = ALIGN(8); \
> > > > __start_mcount_loc = .; \
> > > > KEEP(*(__mcount_loc)) \
> > > > MCOUNT_PATCHABLE \
> > > > __stop_mcount_loc = .; \
> > > > NON_MCOUNT_PATCHABLE \
> > > > ...
> > > >
> > > > ??
> > >
> > > That's what more or less Peter's patch is doing:
> > > Here it is again for reference:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
> >
> > ah nice, and discards the __patchable_function_entries section, great
> >
>
> tested change below with Peter's change above and it seems to work,
> once it get merged I'll send full patch
Peter,
what is the ETA to land your changes?
That particular commit is detached in your git tree.
Did you move it to a different branch?
Just trying to figure out the logistics to land Jiri's fix below.
We can take it into bpf-next, since it's harmless as-is,
but it won't have an effect until your change lands.
Sounds like they both will get in during the next merge window?
> thanks,
> jirka
>
>
> ---
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 39bd36359c1e..39b6807058e9 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -924,6 +924,8 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs);
> }
>
> #define DEFINE_BPF_DISPATCHER(name) \
> + __attribute__((__no_instrument_function__)) \
> + __attribute__((patchable_function_entry(5))) \
> noinline __nocfi unsigned int bpf_dispatcher_##name##_func( \
> const void *ctx, \
> const struct bpf_insn *insnsi, \
On Tue, Aug 23, 2022 at 10:23:24AM -0700, Alexei Starovoitov wrote:
> On Fri, Aug 19, 2022 at 4:45 AM Jiri Olsa <[email protected]> wrote:
> >
> > On Thu, Aug 18, 2022 at 11:32:55PM +0200, Jiri Olsa wrote:
> > > On Thu, Aug 18, 2022 at 02:00:21PM -0700, Alexei Starovoitov wrote:
> > > > On Thu, Aug 18, 2022 at 1:50 PM Steven Rostedt <[email protected]> wrote:
> > > > >
> > > > > On Thu, 18 Aug 2022 22:27:07 +0200
> > > > > Jiri Olsa <[email protected]> wrote:
> > > > >
> > > > > > ok, so the problem with __attribute__((patchable_function_entry(5))) is that
> > > > > > it puts function address into __patchable_function_entries section, which is
> > > > > > one of ftrace locations source:
> > > > > >
> > > > > > #define MCOUNT_REC() . = ALIGN(8); \
> > > > > > __start_mcount_loc = .; \
> > > > > > KEEP(*(__mcount_loc)) \
> > > > > > KEEP(*(__patchable_function_entries)) \
> > > > > > __stop_mcount_loc = .; \
> > > > > > ...
> > > > > >
> > > > > >
> > > > > > it looks like __patchable_function_entries is used for other than x86 archs,
> > > > > > so we perhaps we could have x86 specific MCOUNT_REC macro just with
> > > > > > __mcount_loc section?
> > > > >
> > > > > So something like this:
> > > > >
> > > > > #ifdef CONFIG_X86
> > > > > # define NON_MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > > > > # define MCOUNT_PATCHABLE
> > > > > #else
> > > > > # define NON_MCOUNT_PATCHABLE
> > > > > # define MCOUNT_PATCHABLE KEEP(*(__patchable_function_entries))
> > > > > #endif
> > > > >
> > > > > #define MCOUNT_REC() . = ALIGN(8); \
> > > > > __start_mcount_loc = .; \
> > > > > KEEP(*(__mcount_loc)) \
> > > > > MCOUNT_PATCHABLE \
> > > > > __stop_mcount_loc = .; \
> > > > > NON_MCOUNT_PATCHABLE \
> > > > > ...
> > > > >
> > > > > ??
> > > >
> > > > That's what more or less Peter's patch is doing:
> > > > Here it is again for reference:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?id=8d075bdf11193f1d276bf19fa56b4b8dfe24df9e
> > >
> > > ah nice, and discards the __patchable_function_entries section, great
> > >
> >
> > tested change below with Peter's change above and it seems to work,
> > once it get merged I'll send full patch
>
> Peter,
> what is the ETA to land your changes?
> That particular commit is detached in your git tree.
> Did you move it to a different branch?
>
> Just trying to figure out the logistics to land Jiri's fix below.
> We can take it into bpf-next, since it's harmless as-is,
> but it won't have an effect until your change lands.
> Sounds like they both will get in during the next merge window?
I discussed with Peter and I'll send his change together with my fix
jirka