2022-10-14 16:21:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core

On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index a4e4d84b6f4e..b44806e31a56 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -53,6 +53,7 @@
> #include <linux/bsearch.h>
> #include <linux/dynamic_debug.h>
> #include <linux/audit.h>
> +#include <linux/bpf.h>
> #include <uapi/linux/module.h>
> #include "internal.h"
>
> @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
> lockdep_free_key_range(mod->data_layout.base, mod-
> >data_layout.size);
>
> /* Finally, free the core (containing the module structure)
> */
> - module_memfree(mod->core_layout.base);
> + vfree_exec(mod->core_layout.base);
> #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> vfree(mod->data_layout.base);
> #endif
> @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module *mod,
> const struct load_info *info)
> ksym = resolve_symbol_wait(mod, info, name);
> /* Ok if resolved. */
> if (ksym && !IS_ERR(ksym)) {
> - sym[i].st_value =
> kernel_symbol_value(ksym);
> + unsigned long val =
> kernel_symbol_value(ksym);
> + bpf_arch_text_copy(&sym[i].st_value,
> &val, sizeof(val));

Why bpf_arch_text_copy()? This of course won't work for other
architectures. So there needs to be fallback method. That RFC broke the
operation into two stages: Loading and finalized. When loading, on non-
x86 the writes would simply be to the allocation mapped as writable.
When it was finalized it changed it to it's final permission (RO, etc).
Then for x86 it does text_pokes() for the writes and has it RO from the
beginning.

I ended up needing a staging buffer for modules too, so that the code
could operate on it directly. I can't remember why that was, it might
be unneeded now since you moved data out of the core allocation.



2022-10-14 19:14:19

by Song Liu

[permalink] [raw]
Subject: Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core



> On Oct 14, 2022, at 8:42 AM, Edgecombe, Rick P <[email protected]> wrote:
>
> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>> index a4e4d84b6f4e..b44806e31a56 100644
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -53,6 +53,7 @@
>> #include <linux/bsearch.h>
>> #include <linux/dynamic_debug.h>
>> #include <linux/audit.h>
>> +#include <linux/bpf.h>
>> #include <uapi/linux/module.h>
>> #include "internal.h"
>>
>> @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
>> lockdep_free_key_range(mod->data_layout.base, mod-
>>> data_layout.size);
>>
>> /* Finally, free the core (containing the module structure)
>> */
>> - module_memfree(mod->core_layout.base);
>> + vfree_exec(mod->core_layout.base);
>> #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
>> vfree(mod->data_layout.base);
>> #endif
>> @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module *mod,
>> const struct load_info *info)
>> ksym = resolve_symbol_wait(mod, info, name);
>> /* Ok if resolved. */
>> if (ksym && !IS_ERR(ksym)) {
>> - sym[i].st_value =
>> kernel_symbol_value(ksym);
>> + unsigned long val =
>> kernel_symbol_value(ksym);
>> + bpf_arch_text_copy(&sym[i].st_value,
>> &val, sizeof(val));
>
> Why bpf_arch_text_copy()? This of course won't work for other
> architectures. So there needs to be fallback method. That RFC broke the
> operation into two stages: Loading and finalized. When loading, on non-
> x86 the writes would simply be to the allocation mapped as writable.
> When it was finalized it changed it to it's final permission (RO, etc).
> Then for x86 it does text_pokes() for the writes and has it RO from the
> beginning.

Yeah, this one (3/4) is really a prototype to show vmalloc_exec could
work for modules (with a lot more work of course). And something to
replace bpf_arch_text_copy() is one of the issues we need to address in
the future.

>
> I ended up needing a staging buffer for modules too, so that the code
> could operate on it directly. I can't remember why that was, it might
> be unneeded now since you moved data out of the core allocation.

Both bpf_jit and bpf_dispather uses a staging buffer with bpf_prog_pack.
The benefit of this approach is that it minimizes the number of
text_poke/copy() calls. OTOH, it is quite a pain to make all the
relative calls correct, as the staging buffer has different address to
the final allocation.

I think we may not need the staging buffer for modules, as module
load/unload happens less often than BPF program JITs (so it is ok for
it to be slightly slower).

btw: I cannot take credit for split module data out of core allocation,
Christophe Leroy did the work. :)

Thanks,
Song

2022-10-14 19:14:58

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core

On Fri, 2022-10-14 at 18:26 +0000, Song Liu wrote:
> > On Oct 14, 2022, at 8:42 AM, Edgecombe, Rick P <
> > [email protected]> wrote:
> >
> > On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> > > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > > index a4e4d84b6f4e..b44806e31a56 100644
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -53,6 +53,7 @@
> > > #include <linux/bsearch.h>
> > > #include <linux/dynamic_debug.h>
> > > #include <linux/audit.h>
> > > +#include <linux/bpf.h>
> > > #include <uapi/linux/module.h>
> > > #include "internal.h"
> > >
> > > @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
> > > lockdep_free_key_range(mod->data_layout.base, mod-
> > > > data_layout.size);
> > >
> > > /* Finally, free the core (containing the module
> > > structure)
> > > */
> > > - module_memfree(mod->core_layout.base);
> > > + vfree_exec(mod->core_layout.base);
> > > #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > > vfree(mod->data_layout.base);
> > > #endif
> > > @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module
> > > *mod,
> > > const struct load_info *info)
> > > ksym = resolve_symbol_wait(mod, info,
> > > name);
> > > /* Ok if resolved. */
> > > if (ksym && !IS_ERR(ksym)) {
> > > - sym[i].st_value =
> > > kernel_symbol_value(ksym);
> > > + unsigned long val =
> > > kernel_symbol_value(ksym);
> > > +
> > > bpf_arch_text_copy(&sym[i].st_value,
> > > &val, sizeof(val));
> >
> > Why bpf_arch_text_copy()? This of course won't work for other
> > architectures. So there needs to be fallback method. That RFC broke
> > the
> > operation into two stages: Loading and finalized. When loading, on
> > non-
> > x86 the writes would simply be to the allocation mapped as
> > writable.
> > When it was finalized it changed it to it's final permission (RO,
> > etc).
> > Then for x86 it does text_pokes() for the writes and has it RO from
> > the
> > beginning.
>
> Yeah, this one (3/4) is really a prototype to show vmalloc_exec
> could
> work for modules (with a lot more work of course). And something to
> replace bpf_arch_text_copy() is one of the issues we need to address
> in
> the future.

Right, I think making it work both with and without text_poke() is
needed to ever get it to work with modules. Since so much of it is
cross arch. Oops, it looks like we lost Luis on most of these
responses.

If we don't have that, then the modules RFC is kind of a distraction.
But bpf, kprobes and ftrace is still nice.