2018-03-06 01:36:43

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH net-next] modules: allow modprobe load regular elf binaries

As the first step in development of bpfilter project [1] the request_module()
code is extended to allow user mode helpers to be invoked. Idea is that
user mode helpers are built as part of the kernel build and installed as
traditional kernel modules with .ko file extension into distro specified
location, such that from a distribution point of view, they are no different
than regular kernel modules. Thus, allow request_module() logic to load such
user mode helper (umh) modules via:

request_module("foo") ->
call_umh("modprobe foo") ->
sys_finit_module(FD of /lib/modules/.../foo.ko) ->
call_umh(struct file)

Such approach enables kernel to delegate functionality traditionally done
by kernel modules into user space processes (either root or !root) and
reduces security attack surface of such new code, meaning in case of
potential bugs only the umh would crash but not the kernel. Another
advantage coming with that would be that bpfilter.ko can be debugged and
tested out of user space as well (e.g. opening the possibility to run
all clang sanitizers, fuzzers or test suites for checking translation).
Also, such architecture makes the kernel/user boundary very precise:
control plane is done by the user space while data plane stays in the kernel.

It's easy to distinguish "umh module" from traditional kernel module:

$ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
Type: EXEC (Executable file)
$ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
Type: REL (Relocatable file)

Since umh can crash, can be oom-ed by the kernel, killed by admin,
the subsystem that uses them (like bpfilter) need to manage life
time of umh on its own, so module infra doesn't do any accounting
of them. They don't appear in "lsmod" and cannot be "rmmod".
Multiple request_module("umh") will load multiple umh.ko processes.

Similar to kernel modules the kernel will be tainted if "umh module"
has invalid signature.

[1] https://lwn.net/Articles/747551/

Signed-off-by: Alexei Starovoitov <[email protected]>
---
fs/exec.c | 40 +++++++++++++++++++++++++++++++---------
include/linux/binfmts.h | 1 +
include/linux/umh.h | 3 +++
kernel/module.c | 43 ++++++++++++++++++++++++++++++++++++++-----
kernel/umh.c | 24 +++++++++++++++++++++---
5 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21bcab9..0483c438de7d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags, struct file *file)
{
char *pathbuf = NULL;
struct linux_binprm *bprm;
- struct file *file;
struct files_struct *displaced;
int retval;

@@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename *filename,
check_unsafe_exec(bprm);
current->in_execve = 1;

- file = do_open_execat(fd, filename, flags);
+ if (!file)
+ file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
sched_exec();

bprm->file = file;
- if (fd == AT_FDCWD || filename->name[0] == '/') {
+ if (!filename) {
+ bprm->filename = "/dev/null";
+ } else if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
} else {
if (filename->name[0] == '\0')
@@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename *filename,
task_numa_free(current);
free_bprm(bprm);
kfree(pathbuf);
- putname(filename);
+ if (filename)
+ putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename *filename,
if (displaced)
reset_files_struct(displaced);
out_ret:
- putname(filename);
+ if (filename)
+ putname(filename);
return retval;
}

+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
+{
+ struct file *file = NULL;
+
+ return __do_execve_file(fd, filename, argv, envp, flags, file);
+}
+
+int do_execve_file(struct file *file, void *__argv, void *__envp)
+{
+ struct user_arg_ptr argv = { .ptr.native = __argv };
+ struct user_arg_ptr envp = { .ptr.native = __envp };
+
+ return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
+}
+
int do_execve(struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b0abe21d6cc9..c783a7b9f284 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -147,5 +147,6 @@ extern int do_execveat(int, struct filename *,
const char __user * const __user *,
const char __user * const __user *,
int);
+int do_execve_file(struct file *file, void *__argv, void *__envp);

#endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/umh.h b/include/linux/umh.h
index 244aff638220..2b10d5f70bd9 100644
--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -22,6 +22,7 @@ struct subprocess_info {
const char *path;
char **argv;
char **envp;
+ struct file *file;
int wait;
int retval;
int (*init)(struct subprocess_info *info, struct cred *new);
@@ -38,6 +39,8 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
int (*init)(struct subprocess_info *info, struct cred *new),
void (*cleanup)(struct subprocess_info *), void *data);

+struct subprocess_info *call_usermodehelper_setup_file(struct file *file);
+
extern int
call_usermodehelper_exec(struct subprocess_info *info, int wait);

diff --git a/kernel/module.c b/kernel/module.c
index ad2d420024f6..6cfa35795741 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -325,6 +325,7 @@ struct load_info {
struct {
unsigned int sym, str, mod, vers, info, pcpu;
} index;
+ struct file *file;
};

/*
@@ -2801,6 +2802,15 @@ static int module_sig_check(struct load_info *info, int flags)
}
#endif /* !CONFIG_MODULE_SIG */

+static int run_umh(struct file *file)
+{
+ struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
+
+ if (!sub_info)
+ return -ENOMEM;
+ return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+}
+
/* Sanity checks against invalid binaries, wrong arch, weird elf version. */
static int elf_header_check(struct load_info *info)
{
@@ -2808,7 +2818,6 @@ static int elf_header_check(struct load_info *info)
return -ENOEXEC;

if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0
- || info->hdr->e_type != ET_REL
|| !elf_check_arch(info->hdr)
|| info->hdr->e_shentsize != sizeof(Elf_Shdr))
return -ENOEXEC;
@@ -2818,6 +2827,11 @@ static int elf_header_check(struct load_info *info)
info->len - info->hdr->e_shoff))
return -ENOEXEC;

+ if (info->hdr->e_type == ET_EXEC)
+ return run_umh(info->file);
+
+ if (info->hdr->e_type != ET_REL)
+ return -ENOEXEC;
return 0;
}

@@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const char __user *uargs,
if (err)
goto free_copy;

+ if (info->hdr->e_type == ET_EXEC) {
+#ifdef CONFIG_MODULE_SIG
+ if (!info->sig_ok) {
+ pr_notice_once("umh %s verification failed: signature and/or required key missing - tainting kernel\n",
+ info->file->f_path.dentry->d_name.name);
+ add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
+ }
+#endif
+ return 0;
+ }
+
/* Figure out module layout, and allocate all the memory. */
mod = layout_and_allocate(info, flags);
if (IS_ERR(mod)) {
@@ -3856,6 +3881,7 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
struct load_info info = { };
+ struct fd f;
loff_t size;
void *hdr;
int err;
@@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
|MODULE_INIT_IGNORE_VERMAGIC))
return -EINVAL;

- err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
- READING_MODULE);
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ err = kernel_read_file(f.file, &hdr, &size, INT_MAX, READING_MODULE);
if (err)
- return err;
+ goto out;
info.hdr = hdr;
info.len = size;
+ info.file = f.file;

- return load_module(&info, uargs, flags);
+ err = load_module(&info, uargs, flags);
+out:
+ fdput(f);
+ return err;
}

static inline int within(unsigned long addr, void *start, unsigned long size)
diff --git a/kernel/umh.c b/kernel/umh.c
index 18e5fa4b0e71..4361c694bdb1 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -97,9 +97,13 @@ static int call_usermodehelper_exec_async(void *data)

commit_creds(new);

- retval = do_execve(getname_kernel(sub_info->path),
- (const char __user *const __user *)sub_info->argv,
- (const char __user *const __user *)sub_info->envp);
+ if (sub_info->file)
+ retval = do_execve_file(sub_info->file,
+ sub_info->argv, sub_info->envp);
+ else
+ retval = do_execve(getname_kernel(sub_info->path),
+ (const char __user *const __user *)sub_info->argv,
+ (const char __user *const __user *)sub_info->envp);
out:
sub_info->retval = retval;
/*
@@ -393,6 +397,20 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
}
EXPORT_SYMBOL(call_usermodehelper_setup);

+struct subprocess_info *call_usermodehelper_setup_file(struct file *file)
+{
+ struct subprocess_info *sub_info;
+
+ sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
+ if (!sub_info)
+ return NULL;
+
+ INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
+ sub_info->path = "/dev/null";
+ sub_info->file = file;
+ return sub_info;
+}
+
/**
* call_usermodehelper_exec - start a usermode application
* @sub_info: information about the subprocessa
--
2.9.5



2018-03-06 02:15:00

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

Hi,

On 03/05/2018 05:34 PM, Alexei Starovoitov wrote:

> diff --git a/kernel/module.c b/kernel/module.c
> index ad2d420024f6..6cfa35795741 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c

> @@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const char __user *uargs,
> if (err)
> goto free_copy;
>
> + if (info->hdr->e_type == ET_EXEC) {
> +#ifdef CONFIG_MODULE_SIG
> + if (!info->sig_ok) {
> + pr_notice_once("umh %s verification failed: signature and/or required key missing - tainting kernel\n",

That's not a very friendly message to tell a user. "umh" eh?

> + info->file->f_path.dentry->d_name.name);
> + add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
> + }

And since the signature failed, why is it being loaded at all?
Is this in the "--force" load path?

> +#endif
> + return 0;
> + }
> +
> /* Figure out module layout, and allocate all the memory. */
> mod = layout_and_allocate(info, flags);
> if (IS_ERR(mod)) {

thanks,
--
~Randy

2018-03-06 03:04:31

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/5/18 6:13 PM, Randy Dunlap wrote:
> Hi,
>
> On 03/05/2018 05:34 PM, Alexei Starovoitov wrote:
>
>> diff --git a/kernel/module.c b/kernel/module.c
>> index ad2d420024f6..6cfa35795741 100644
>> --- a/kernel/module.c
>> +++ b/kernel/module.c
>
>> @@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const char __user *uargs,
>> if (err)
>> goto free_copy;
>>
>> + if (info->hdr->e_type == ET_EXEC) {
>> +#ifdef CONFIG_MODULE_SIG
>> + if (!info->sig_ok) {
>> + pr_notice_once("umh %s verification failed: signature and/or required key missing - tainting kernel\n",
>
> That's not a very friendly message to tell a user. "umh" eh?

umh is an abbreviation known to kernel newbies:
https://kernelnewbies.org/KernelProjects/usermode-helper-enhancements

The rest of the message is copy paste of existing one.

>> + info->file->f_path.dentry->d_name.name);
>> + add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
>> + }
>
> And since the signature failed, why is it being loaded at all?

because this is how regular kernel modules deal with it.
sig_enforce is handled earlier.

> Is this in the "--force" load path?

--force forces modver and modmagic. These things don't apply here.


2018-03-06 11:08:05

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Mon, Mar 05, 2018 at 05:34:57PM -0800, Alexei Starovoitov wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
> request_module("foo") ->
> call_umh("modprobe foo") ->
> sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
>
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).
> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.
>
> It's easy to distinguish "umh module" from traditional kernel module:
>
> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
> Type: EXEC (Executable file)
> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
> Type: REL (Relocatable file)

Any chance you can add a field to your "umh module" type such that a
normal 'modinfo' program will be able to notice it is different easily?

> Since umh can crash, can be oom-ed by the kernel, killed by admin,
> the subsystem that uses them (like bpfilter) need to manage life
> time of umh on its own, so module infra doesn't do any accounting
> of them. They don't appear in "lsmod" and cannot be "rmmod".
> Multiple request_module("umh") will load multiple umh.ko processes.
>
> Similar to kernel modules the kernel will be tainted if "umh module"
> has invalid signature.

Shouldn't we fail to load the "module" if the signature is not valid if
CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules? I run my
systems like that, and just "warning" isn't probably a good idea for
systems that want to enforce that everything is signed properly?

Other than that, one minor question:

> @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - if (fd == AT_FDCWD || filename->name[0] == '/') {
> + if (!filename) {
> + bprm->filename = "/dev/null";

Why the use of "/dev/null" for the filename here, and elsewhere in the
code? While I'm "sure" that everyone really does have /dev/null/
mounted in the root namespace, what is the use of it here?

Also, what "namespace" does this usermode helper run in? I'm guessing
the "root" one, which is fine with me, but note that people have
complained in the past about other UMH running in that namespace and not
in the specific namespace of the "container" that they wanted it to run
in.

Anyway, this is crazy stuff, but I like the idea and have no objection
to it overall :)

thanks,

greg k-h

2018-03-06 19:13:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov <[email protected]> wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.

Linus

2018-03-06 20:03:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Tue, Mar 6, 2018 at 1:34 AM, Alexei Starovoitov <[email protected]> wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
> request_module("foo") ->
> call_umh("modprobe foo") ->
> sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
>

I assume I'm missing some context here, but why does this need to be
handled by the kernel rather than, say, a change to how modprobe
works? I imagine that usermode tooling needs to change regardless
because the existing tools may get rather confused if a .ko "module"
is really a dynamically liked program. I notice that you're using
ET_EXEC in your example, and that will probably avoid problems, but I
imagine that some distros would much rather use ET_DYN.

2018-03-06 20:28:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Tue, Mar 6, 2018 at 12:01 PM, Andy Lutomirski <[email protected]> wrote:
>
> I assume I'm missing some context here, but why does this need to be
> handled by the kernel rather than, say, a change to how modprobe
> works?

Honestly, the less we have to mess with user-mode tooling, the better.

We've been *so* much better off moving most of the module loading
logic to the kernel, we should not go back in the old broken
direction.

I do *not* want the kmod project that is then taken over by systemd,
and breaks it the same way they broke firmware loading.

Keep modprobe doing one thing, and one thing only: track dependencies
and mindlessly just load the modules. Do *not* ask for it to do
anything else.

Right now kmod is a nice simple project. Lots of testsuite stuff, and
a very clear goal. Let's keep kmod doing one thing, and not even have
to care about internal kernel decisions like "oh, this module might
not be a module, but an executable".

If anything, I think we want to keep our options open, in the case we
need or want to ever consider short-circuiting things and allowing
direct loading of the simple cases and bypassing modprobe entirely.

Linus

2018-03-06 23:45:30

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 6 Mar 2018, at 11:12, Linus Torvalds wrote:

> On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov <[email protected]>
> wrote:
>> As the first step in development of bpfilter project [1] the
>> request_module()
>> code is extended to allow user mode helpers to be invoked. Idea is
>> that
>> user mode helpers are built as part of the kernel build and installed
>> as
>> traditional kernel modules with .ko file extension into distro
>> specified
>> location, such that from a distribution point of view, they are no
>> different
>> than regular kernel modules. Thus, allow request_module() logic to
>> load such
>> user mode helper (umh) modules via:
> [,,]
>
> I like this, but I have one request: can we make sure that this action
> is visible in the system messages?
>
> When we load a regular module, at least it shows in lsmod afterwards,
> although I have a few times wanted to really see module load as an
> event in the logs too.
>
> When we load a module that just executes a user program, and there is
> no sign of it in the module list, I think we *really* need to make
> that event show to the admin some way.
>
> .. and yes, maybe we'll need to rate-limit the messages, and maybe it
> turns out that I'm entirely wrong and people will hate the messages
> after they get used to the concept of these pseudo-modules, but
> particularly for the early implementation when this is a new thing, I
> really want a message like
>
> executed user process xyz-abc as a pseudo-module
>
> or something in dmesg.
>
> I do *not* want this to be a magical way to hide things.

Especially early on, this makes a lot of sense. But I wanted to plug
bps and the hopefully growing set of bpf introspection tools:

https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt

Long term these are probably a good place to tell the admin what's going
on.

-chris

2018-03-07 01:16:50

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

combining multiple answers...

On 3/6/18 3:05 AM, Greg KH wrote:
>
> Any chance you can add a field to your "umh module" type such that a
> normal 'modinfo' program will be able to notice it is different easily?

ok. handling of modinfo turned out to be straightforward.
kmod tooling worked fine with simple addition of .modinfo section.

$ modinfo bpfilter
filename:
/lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/net/bpfilter/bpfilter.ko
umh: Y
license: GPL

I will require umh=Y and license to be present.
umh has to be set to Y for this 'umh modules'
and taint of kernel will happen if license is not gpl.
Other modinfo like vermagic are not applicable here, since
umh modules interact with kernel via normal kernel/user abi.

>> Since umh can crash, can be oom-ed by the kernel, killed by admin,
>> the subsystem that uses them (like bpfilter) need to manage life
>> time of umh on its own, so module infra doesn't do any accounting
>> of them. They don't appear in "lsmod" and cannot be "rmmod".
>> Multiple request_module("umh") will load multiple umh.ko processes.
>>
>> Similar to kernel modules the kernel will be tainted if "umh module"
>> has invalid signature.
>
> Shouldn't we fail to load the "module" if the signature is not valid if
> CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules? I run my
> systems like that, and just "warning" isn't probably a good idea for
> systems that want to enforce that everything is signed properly?

CONFIG_MODULE_SIG_FORCE=y is already handled by this patch.
It's checked first for either .ko or umh.ko (before any elf parsing)
and returns -ENOKEY to user space without any dmesg message.
I think it's best to keep it as-is.
The taint and warning is for 'undef SIG_FORCE' and when module
is signed, but incorrectly.

>
> Other than that, one minor question:
>
>> @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
>> sched_exec();
>>
>> bprm->file = file;
>> - if (fd == AT_FDCWD || filename->name[0] == '/') {
>> + if (!filename) {
>> + bprm->filename = "/dev/null";
>
> Why the use of "/dev/null" for the filename here, and elsewhere in the
> code? While I'm "sure" that everyone really does have /dev/null/
> mounted in the root namespace, what is the use of it here?

filename is assumed to be non-null in several places further
down and instead of hacking it everywhere it's cleaner to assign
some string to it.
I'll change it to filename = "none"
Same in umh part.

> Also, what "namespace" does this usermode helper run in? I'm guessing
> the "root" one, which is fine with me, but note that people have
> complained in the past about other UMH running in that namespace and not
> in the specific namespace of the "container" that they wanted it to run
> in.

right. this is something we can tweak later if really necessary.
Right now most of the bpf is root-only, so bpfilter.ko would have to run
as cap_sys_admin for now. Later we plan to tighten it to be
cap_net_admin.


On 3/6/18 11:12 AM, Linus Torvalds wrote:
>
> particularly for the early implementation when this is a new thing, I
> really want a message like
>
> executed user process xyz-abc as a pseudo-module
>
> or something in dmesg.
>
> I do *not* want this to be a magical way to hide things.

right. no intent of hiding anything.
The first thing bpfilter.ko does is print 'Starting bpfilter'
into /dev/console.

Long term the health check of 'umh module' and interaction with
the kernel should be standardized and though they're normal processes
seen with 'ps' would be good to see them in lsmod as well.
For now it's indeed the best to do pr_warn() message like above.
Ratelimiting is probably not necessary.


On 3/6/18 12:01 PM, Andy Lutomirski wrote:
>
> I imagine that usermode tooling needs to change regardless
> because the existing tools may get rather confused if a .ko "module"

the goal is to do zero changes to user tooling.
The kmod tools handle this special .ko just fine.
Tested with modprobe, depmod, modinfo, insmod.
scripts/sign-file also works.


2018-03-07 03:25:25

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Tue, Mar 06, 2018 at 05:07:45PM -0800, Alexei Starovoitov wrote:
> combining multiple answers...
>
> On 3/6/18 3:05 AM, Greg KH wrote:
> >
> > Any chance you can add a field to your "umh module" type such that a
> > normal 'modinfo' program will be able to notice it is different easily?
>
> ok. handling of modinfo turned out to be straightforward.
> kmod tooling worked fine with simple addition of .modinfo section.
>
> $ modinfo bpfilter
> filename:
> /lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/net/bpfilter/bpfilter.ko
> umh: Y

Nice. But perhaps spell it out, "user_mode_helper"? Anyway,
bikesheding now, sorry, whatever you want to call it is fine with me.

> license: GPL
>
> I will require umh=Y and license to be present.
> umh has to be set to Y for this 'umh modules'
> and taint of kernel will happen if license is not gpl.

Interesting, I like it :)


> Other modinfo like vermagic are not applicable here, since
> umh modules interact with kernel via normal kernel/user abi.

Very true.

> > > Since umh can crash, can be oom-ed by the kernel, killed by admin,
> > > the subsystem that uses them (like bpfilter) need to manage life
> > > time of umh on its own, so module infra doesn't do any accounting
> > > of them. They don't appear in "lsmod" and cannot be "rmmod".
> > > Multiple request_module("umh") will load multiple umh.ko processes.
> > >
> > > Similar to kernel modules the kernel will be tainted if "umh module"
> > > has invalid signature.
> >
> > Shouldn't we fail to load the "module" if the signature is not valid if
> > CONFIG_MODULE_SIG_FORCE=y is enabled, like we do for modules? I run my
> > systems like that, and just "warning" isn't probably a good idea for
> > systems that want to enforce that everything is signed properly?
>
> CONFIG_MODULE_SIG_FORCE=y is already handled by this patch.
> It's checked first for either .ko or umh.ko (before any elf parsing)
> and returns -ENOKEY to user space without any dmesg message.
> I think it's best to keep it as-is.
> The taint and warning is for 'undef SIG_FORCE' and when module
> is signed, but incorrectly.

Ah, sorry, I missed that, thanks for clearing it up.

> > Other than that, one minor question:
> >
> > > @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> > > sched_exec();
> > >
> > > bprm->file = file;
> > > - if (fd == AT_FDCWD || filename->name[0] == '/') {
> > > + if (!filename) {
> > > + bprm->filename = "/dev/null";
> >
> > Why the use of "/dev/null" for the filename here, and elsewhere in the
> > code? While I'm "sure" that everyone really does have /dev/null/
> > mounted in the root namespace, what is the use of it here?
>
> filename is assumed to be non-null in several places further
> down and instead of hacking it everywhere it's cleaner to assign
> some string to it.
> I'll change it to filename = "none"
> Same in umh part.

Thanks, that makes sense.

greg k-h

2018-03-07 17:24:03

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

From: Alexei Starovoitov <[email protected]>
Date: Mon, 5 Mar 2018 17:34:57 -0800

> As the first step in development of bpfilter project [1] the
> request_module() code is extended to allow user mode helpers to be
> invoked.

Looks like there is a little bit of feedback requiring changes, please
take care of that and resubmit.

Thanks!

2018-03-08 01:25:30

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Mon, Mar 05, 2018 at 05:34:57PM -0800, Alexei Starovoitov wrote:
> As the first step in development of bpfilter project [1]

So meta :) The URL refers an lwn article, which in turn refers to this effort's
first RFC. As someone only getting *one* of these patches in emails, It would
be useful if the cover letter referenced instead an optional git tree and
branch so one could easily get the patches for more careful inspection. In the
meantime, can you make such tree available with a branch?

Also, since kernel/module.c is involved it would be wise to include Jessica,
which I've Cc'd. I'm going to guess Kees may like to review this too. Mimi
might want to help review as well.

Rafael may care about suspend/resume implications of these "umh modules" as
you put it.

> the request_module()
> code is extended to allow user mode helpers to be invoked.

Upon inspection you never touch or use request_module() so this is
false and misleading. You don't use or extend request_module() at all, you rely
on extending finit_module() so that load_module() itself will now execute (via
modified do_execve_file()) the same file which was loaded as an module.

This is *very* different given request_module() has its own full magic and is
in and of itself a UMH of the kernel implemented in kernel/kmod.c.

Nevertheless, why?

> Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules.

It sounds like finit_module() module loading is used as a convenience factor
simply to take advantage of being able to ship / maintain/ compile these umh
programs as part of the kernel. Is that right?

So the finit_module() interface was just a convenient mechanism. Is that right?

Ie, if folks had these binaries in place the regular UMH interface / API could be
used so that these could be looked for, but instead we want to carry these
in tandem with the kernel?

If so this still seems like an overly complex way to deal with this.

> Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
> request_module("foo") ->
> call_umh("modprobe foo") ->
> sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)

OK so the use case envisioned here was for networking code to do
something like:

if (!loaded) {
err = request_module("bpfilter");
...
}

This is visible on your third patch (this is from your RFC, not this series):

https://www.mail-archive.com/[email protected]/msg11129.html

So indeed all this patch does in the end is just putting tons of wrappers in
place so that kernel code can load certain trusted UMH programs we ship, and
maintain in the kernel.

request_module() has its own world though too. How often in your proof of
concept is request_module() called? How many times do you envision it being
called?

Please review lib/test_kmod.c and tools/testing/selftests/kmod/ for testing
your stuff too or consider extending appropriately.

Are aliases something which you expect we'll need to support for these
userspace... modules?

> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel.

Now, this sounds great, however I think that the proof of concept chosen is
pretty complex to start off with. Even if its not designed to be a real
world life use case, a much simpler proof of concept to do something
more simple may be useful, if possible. One wouldn't need to to have it
replace a kernel functionality in real life. lib/ is full of CONFIG_TEST_*
examples, a simple new stupid kernel functionality which can in turn be replaced
with a respective userspace counterpart may be useful, and both kconfig
entries would be mutually exclusive.

> Another
> advantage coming with that would be that bpfilter.ko

You mean foo.ko

> can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).

Great too.

> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.

I don't see how this is defining any boundary, I see just a loader for a
userspace program, and re-using a kernel interface known, finit_module() which
makes it convenient for us to load pre-compiled kernel junk. I'm still
not convinced this is the right approach.

> It's easy to distinguish "umh module" from traditional kernel module:

Ah you said it, "umh module". I don't see what makes it a "umh module" so far,
all we are doing is executing a userspace program a la UMH. But its a
synchronous call, so we wait. The only modular thing here I see is we took the
finit_module() to be able to loader programs compiled in the kernel. Its not
using existing kernel symbols, its not exported symbols, etc.

> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
> Type: EXEC (Executable file)
> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
> Type: REL (Relocatable file)
>
> Since umh can crash, can be oom-ed by the kernel, killed by admin,
> the subsystem that uses them (like bpfilter) need to manage life
> time of umh on its own, so module infra doesn't do any accounting
> of them.

Trying to avoid the "module infra" in theory sounds great, however you
did extend it a bit.

Also, the umh has its own infrastructure which I've slowly also been trying to
groom and cleanup with a few wtf's still left in place and which if we don't
take care, extending this use could backfire in other ways. Feedback
below.

> They don't appear in "lsmod" and cannot be "rmmod".
> Multiple request_module("umh") will load multiple umh.ko processes.

That also means we have a namespace collision between these programs
and kernel modules. Does compilation warn if we hit a conflict here
already? What about with aliases?

> Similar to kernel modules the kernel will be tainted if "umh module"
> has invalid signature.
>
> [1] https://lwn.net/Articles/747551/
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> fs/exec.c | 40 +++++++++++++++++++++++++++++++---------
> include/linux/binfmts.h | 1 +
> include/linux/umh.h | 3 +++
> kernel/module.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> kernel/umh.c | 24 +++++++++++++++++++++---
> 5 files changed, 94 insertions(+), 17 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 7eb8d21bcab9..0483c438de7d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
> /*
> * sys_execve() executes a new program.
> */
> -static int do_execveat_common(int fd, struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp,
> - int flags)
> +static int __do_execve_file(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags, struct file *file)
> {
> char *pathbuf = NULL;
> struct linux_binprm *bprm;
> - struct file *file;
> struct files_struct *displaced;
> int retval;
>
> @@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> - file = do_open_execat(fd, filename, flags);
> + if (!file)
> + file = do_open_execat(fd, filename, flags);
> retval = PTR_ERR(file);
> if (IS_ERR(file))
> goto out_unmark;
> @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - if (fd == AT_FDCWD || filename->name[0] == '/') {
> + if (!filename) {
> + bprm->filename = "/dev/null";
> + } else if (fd == AT_FDCWD || filename->name[0] == '/') {
> bprm->filename = filename->name;
> } else {
> if (filename->name[0] == '\0')
> @@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> task_numa_free(current);
> free_bprm(bprm);
> kfree(pathbuf);
> - putname(filename);
> + if (filename)
> + putname(filename);
> if (displaced)
> put_files_struct(displaced);
> return retval;
> @@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename *filename,
> if (displaced)
> reset_files_struct(displaced);
> out_ret:
> - putname(filename);
> + if (filename)
> + putname(filename);
> return retval;
> }
>
> +static int do_execveat_common(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags)
> +{
> + struct file *file = NULL;
> +
> + return __do_execve_file(fd, filename, argv, envp, flags, file);
> +}
> +
> +int do_execve_file(struct file *file, void *__argv, void *__envp)
> +{
> + struct user_arg_ptr argv = { .ptr.native = __argv };
> + struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> + return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
> +}
> +
> int do_execve(struct filename *filename,
> const char __user *const __user *__argv,
> const char __user *const __user *__envp)
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index b0abe21d6cc9..c783a7b9f284 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -147,5 +147,6 @@ extern int do_execveat(int, struct filename *,
> const char __user * const __user *,
> const char __user * const __user *,
> int);
> +int do_execve_file(struct file *file, void *__argv, void *__envp);
>
> #endif /* _LINUX_BINFMTS_H */
> diff --git a/include/linux/umh.h b/include/linux/umh.h
> index 244aff638220..2b10d5f70bd9 100644
> --- a/include/linux/umh.h
> +++ b/include/linux/umh.h
> @@ -22,6 +22,7 @@ struct subprocess_info {
> const char *path;
> char **argv;
> char **envp;
> + struct file *file;
> int wait;
> int retval;
> int (*init)(struct subprocess_info *info, struct cred *new);
> @@ -38,6 +39,8 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
> int (*init)(struct subprocess_info *info, struct cred *new),
> void (*cleanup)(struct subprocess_info *), void *data);
>
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file);
> +
> extern int
> call_usermodehelper_exec(struct subprocess_info *info, int wait);
>
> diff --git a/kernel/module.c b/kernel/module.c
> index ad2d420024f6..6cfa35795741 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -325,6 +325,7 @@ struct load_info {
> struct {
> unsigned int sym, str, mod, vers, info, pcpu;
> } index;
> + struct file *file;
> };
>
> /*
> @@ -2801,6 +2802,15 @@ static int module_sig_check(struct load_info *info, int flags)
> }
> #endif /* !CONFIG_MODULE_SIG */
>
> +static int run_umh(struct file *file)
> +{
> + struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
> +
> + if (!sub_info)
> + return -ENOMEM;
> + return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> +}

run_umh() calls the program and waits. Note that while we are running a UMH we
can't suspend. What if they take forever, who is hosing them down with an
equivalent:

schedule();
try_to_freeze();

Say they are buggy and never return, will they stall suspend, have you tried?

call_usermodehelper_exec() uses helper_lock() which in turn is used for
umh's accounting for number of running umh's. One of the sad obscure uses
for this is to deal with suspend/resume. Refer to __usermodehelper* calls
on kernel/power/process.c

Note how you use UMH_WAIT_EXEC too, so this is all synchronous.

How many of these calls do you expect to be flying through? Is there any
chance a system could want to keep spawning a lot of these without any
chance of them yielding a bit?

I'm not a fan of this use of the UMH lock thing, but the reason for it was back
in the day today's firmware fallback mechanism used to be the default firmware
loading interface, and it relied on a sysfs loading interface, and *with* this
UMH locking thing we prevented system stalls on suspend/resume by first
checking if they were supposed to run or not. But *only* the firmware loader
interface uses the UMH lock API calls:

o usermodehelper_read_lock_wait()
o usermodehelper_read_trylock()

A long term goal which I had recently was to actually get rid of these stupid
calls at least out of the way from the direct firmware lookup calls since no
UMH was involved and fortunately we fast-paced that that upstream [0], refer to
"devil is in the details".

[0] https://patchwork.kernel.org/patch/9949775/

But -- other than not stalling suspend, an other implicit reason for it, was in
turn the assumption that the files would be available whenever the callers (in
this case the firmware API) needed them, and then there could be a race with
the UMH callers and any respective subsystem filesystem which provides the files
going down. This is a *theoretical* consideration any other UMH caller needs to
make today.

This later race is addressed today with the firmware cache implementation, for
direct firmware filesystem lookups.

And yes, if you are trying to poke around files from the filesystem and are
suspending/resuming you may just not get access to some of them today, I just
dealt with one case recently reported [1]. So the firmware cache is *still*
required today.

Seeing any new broad user of of the UMH gives me concern if the above lessons
are not taken into consideration. Do we *need* these files to be sure to be
present during suspend/resume? How are we sure we won't race with
suspend/resume?

What if the same file is loaded multiple times, why not have one shared file
pointer, and have all users use that while such requests are in place? The
firmware loader has this in place with "batched requests".

[1] https://lkml.kernel.org/r/[email protected]

If it sounds like the direct firmware loading interface is more robust, and
reliable, and tested to load files its because that is exactly what it was
designed to do. The kernel UMH is a simple interface with limited use and I'd
caution further *extensive* uses of it.

> +
> /* Sanity checks against invalid binaries, wrong arch, weird elf version. */
> static int elf_header_check(struct load_info *info)
> {
> @@ -2808,7 +2818,6 @@ static int elf_header_check(struct load_info *info)
> return -ENOEXEC;
>
> if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0
> - || info->hdr->e_type != ET_REL
> || !elf_check_arch(info->hdr)
> || info->hdr->e_shentsize != sizeof(Elf_Shdr))
> return -ENOEXEC;
> @@ -2818,6 +2827,11 @@ static int elf_header_check(struct load_info *info)
> info->len - info->hdr->e_shoff))
> return -ENOEXEC;
>
> + if (info->hdr->e_type == ET_EXEC)
> + return run_umh(info->file);

Note you run_umh() on elf_header_check().

> +
> + if (info->hdr->e_type != ET_REL)
> + return -ENOEXEC;
> return 0;
> }
>
> @@ -3669,6 +3683,17 @@ static int load_module(struct load_info *info, const char __user *uargs,

The missing context line below is:

err = elf_header_check(info);
> if (err)
> goto free_copy;
>
> + if (info->hdr->e_type == ET_EXEC) {
> +#ifdef CONFIG_MODULE_SIG
> + if (!info->sig_ok) {
> + pr_notice_once("umh %s verification failed: signature and/or required key missing - tainting kernel\n",
> + info->file->f_path.dentry->d_name.name);
> + add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
> + }
> +#endif

So I guess this check is done *after* run_umh() then, what about the enforce mode,
don't we want to reject loading at all in any circumstance?

> + return 0;
> + }
> +
> /* Figure out module layout, and allocate all the memory. */
> mod = layout_and_allocate(info, flags);
> if (IS_ERR(mod)) {
> @@ -3856,6 +3881,7 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
> SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
> {
> struct load_info info = { };
> + struct fd f;
> loff_t size;
> void *hdr;
> int err;
> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
> |MODULE_INIT_IGNORE_VERMAGIC))
> return -EINVAL;
>
> - err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
> - READING_MODULE);
> + f = fdget(fd);
> + if (!f.file)
> + return -EBADF;
> +
> + err = kernel_read_file(f.file, &hdr, &size, INT_MAX, READING_MODULE);
> if (err)

I wonder if from an LSM perspective security_kernel_read_file() and
READING_MODULE will suffice or if we want to have a different identifier here.

Luis

> - return err;
> + goto out;
> info.hdr = hdr;
> info.len = size;
> + info.file = f.file;
>
> - return load_module(&info, uargs, flags);
> + err = load_module(&info, uargs, flags);
> +out:
> + fdput(f);
> + return err;
> }
>
> static inline int within(unsigned long addr, void *start, unsigned long size)
> diff --git a/kernel/umh.c b/kernel/umh.c
> index 18e5fa4b0e71..4361c694bdb1 100644
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -97,9 +97,13 @@ static int call_usermodehelper_exec_async(void *data)
>
> commit_creds(new);
>
> - retval = do_execve(getname_kernel(sub_info->path),
> - (const char __user *const __user *)sub_info->argv,
> - (const char __user *const __user *)sub_info->envp);
> + if (sub_info->file)
> + retval = do_execve_file(sub_info->file,
> + sub_info->argv, sub_info->envp);
> + else
> + retval = do_execve(getname_kernel(sub_info->path),
> + (const char __user *const __user *)sub_info->argv,
> + (const char __user *const __user *)sub_info->envp);
> out:
> sub_info->retval = retval;
> /*
> @@ -393,6 +397,20 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
> }
> EXPORT_SYMBOL(call_usermodehelper_setup);
>
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file)
> +{
> + struct subprocess_info *sub_info;
> +
> + sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
> + if (!sub_info)
> + return NULL;
> +
> + INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
> + sub_info->path = "/dev/null";
> + sub_info->file = file;
> + return sub_info;
> +}
> +
> /**
> * call_usermodehelper_exec - start a usermode application
> * @sub_info: information about the subprocessa
> --
> 2.9.5
>
>

--
Luis Rodriguez, SUSE LINUX GmbH
Maxfeldstrasse 5; D-90409 Nuernberg

2018-03-08 23:08:46

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/7/18 5:23 PM, Luis R. Rodriguez wrote:
>
> request_module() has its own world though too. How often in your proof of
> concept is request_module() called? How many times do you envision it being
> called?

once.

>> +static int run_umh(struct file *file)
>> +{
>> + struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
>> +
>> + if (!sub_info)
>> + return -ENOMEM;
>> + return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
>> +}
>
> run_umh() calls the program and waits. Note that while we are running a UMH we
> can't suspend. What if they take forever, who is hosing them down with an
> equivalent:
>
> schedule();
> try_to_freeze();
>
> Say they are buggy and never return, will they stall suspend, have you tried?
>
> call_usermodehelper_exec() uses helper_lock() which in turn is used for
> umh's accounting for number of running umh's. One of the sad obscure uses
> for this is to deal with suspend/resume. Refer to __usermodehelper* calls
> on kernel/power/process.c
>
> Note how you use UMH_WAIT_EXEC too, so this is all synchronous.

looks like you misread this code and the rest of your concerns
with suspend/resume are not applicable any more.

#define UMH_NO_WAIT 0 /* don't wait at all */
#define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */
#define UMH_WAIT_PROC 2 /* wait for the process to complete */
#define UMH_KILLABLE 4 /* wait for EXEC/PROC killable */

bpftiler.ko is started once and runs non-stop from there on.
Unless it crashes, but it's a different discussion.

>> + if (info->hdr->e_type == ET_EXEC) {
>> +#ifdef CONFIG_MODULE_SIG
>> + if (!info->sig_ok) {
>> + pr_notice_once("umh %s verification failed: signature and/or required key missing - tainting kernel\n",
>> + info->file->f_path.dentry->d_name.name);
>> + add_taint(TAINT_UNSIGNED_MODULE, LOCKDEP_STILL_OK);
>> + }
>> +#endif
>
> So I guess this check is done *after* run_umh() then, what about the enforce mode,
> don't we want to reject loading at all in any circumstance?

already answered this twice in the thread.


2018-03-09 00:27:04

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

How is this not marked [RFC]? :)

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov <[email protected]> wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
> request_module("foo") ->
> call_umh("modprobe foo") ->
> sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)

Yikes. This means autoloaded artbitrary exec() now, instead of
autoloading just loading arbitrary modules? That seems extremely bad.
This just extends all the problems we've had with defining security
boundaries with modules out to umh too. I would need some major
convincing that this can be made safe.

It's easy for container to trigger arbitrary module loading, so if it
can find/use a similar one of the bugs we've seen in the past to
redirect the module loading we don't just get new module-created
attack surface added to the kernel, but we again get arbitrary ELF
running in the root ns (not in the container). And in the insane case
of a container with CAP_SYS_MODULE, this is a trivial escape without
even needing to build a special kernel module: the root ns, running
with all privileges runs an arbitrary ELF.

As it stands, I have to NAK this. At the very least, you need to solve
the execution environment problems here: the ELF should run with no
greater privileges than what loaded the module, and very importantly,
must not be allowed to bypass these checks through autoloading. What
_triggered_ the autoload must be the environment, not the "modprobe",
since that's running with full privileges. Basically, we need to fix
module autoloading first, or at least a significant subset of the
problems: https://lkml.org/lkml/2017/11/27/754

> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).
> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.
>
> It's easy to distinguish "umh module" from traditional kernel module:
>
> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
> Type: EXEC (Executable file)

As Andy asked earlier, why not DYN too to catch PIE executables? Seems
like forcing the userspace helper to be non-PIE would defeat some of
the userspace defenses in use in most distros.

> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
> Type: REL (Relocatable file)
>
> Since umh can crash, can be oom-ed by the kernel, killed by admin,
> the subsystem that uses them (like bpfilter) need to manage life
> time of umh on its own, so module infra doesn't do any accounting
> of them. They don't appear in "lsmod" and cannot be "rmmod".
> Multiple request_module("umh") will load multiple umh.ko processes.
>
> Similar to kernel modules the kernel will be tainted if "umh module"
> has invalid signature.
>
> [1] https://lwn.net/Articles/747551/
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> fs/exec.c | 40 +++++++++++++++++++++++++++++++---------
> include/linux/binfmts.h | 1 +
> include/linux/umh.h | 3 +++
> kernel/module.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> kernel/umh.c | 24 +++++++++++++++++++++---
> 5 files changed, 94 insertions(+), 17 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 7eb8d21bcab9..0483c438de7d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
> /*
> * sys_execve() executes a new program.
> */
> -static int do_execveat_common(int fd, struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp,
> - int flags)
> +static int __do_execve_file(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags, struct file *file)
> {
> char *pathbuf = NULL;
> struct linux_binprm *bprm;
> - struct file *file;
> struct files_struct *displaced;
> int retval;
>
> @@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> - file = do_open_execat(fd, filename, flags);
> + if (!file)
> + file = do_open_execat(fd, filename, flags);
> retval = PTR_ERR(file);
> if (IS_ERR(file))
> goto out_unmark;
> @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - if (fd == AT_FDCWD || filename->name[0] == '/') {
> + if (!filename) {
> + bprm->filename = "/dev/null";
> + } else if (fd == AT_FDCWD || filename->name[0] == '/') {
> bprm->filename = filename->name;
> } else {
> if (filename->name[0] == '\0')
> @@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> task_numa_free(current);
> free_bprm(bprm);
> kfree(pathbuf);
> - putname(filename);
> + if (filename)
> + putname(filename);
> if (displaced)
> put_files_struct(displaced);
> return retval;
> @@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename *filename,
> if (displaced)
> reset_files_struct(displaced);
> out_ret:
> - putname(filename);
> + if (filename)
> + putname(filename);
> return retval;
> }
>
> +static int do_execveat_common(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags)
> +{
> + struct file *file = NULL;
> +
> + return __do_execve_file(fd, filename, argv, envp, flags, file);
> +}
> +
> +int do_execve_file(struct file *file, void *__argv, void *__envp)
> +{
> + struct user_arg_ptr argv = { .ptr.native = __argv };
> + struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> + return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
> +}

The exec.c changes should be split into a separate patch. Something
feels incorrectly refactored here, though. Passing all three of fd,
filename, and file to __do_execve_file() seems wrong; perhaps the fd
to file handling needs to happen externally in what you have here as
do_execveat_common()? The resulting __do_execve_file() has repeated
conditionals on filename... maybe what I object to is being able to
pass a NULL filename at all. The filename can be (painfully)
reconstructed from the struct file.

> [...]
> diff --git a/kernel/module.c b/kernel/module.c
> index ad2d420024f6..6cfa35795741 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> [...]
> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
> |MODULE_INIT_IGNORE_VERMAGIC))
> return -EINVAL;
>
> - err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
> - READING_MODULE);
> + f = fdget(fd);
> + if (!f.file)
> + return -EBADF;
> +
> + err = kernel_read_file(f.file, &hdr, &size, INT_MAX, READING_MODULE);

For the LSM subsystem, I think this should also get it's own enum
kernel_read_file_id. This is really no longer a kernel module...

> if (err)
> - return err;
> + goto out;
> info.hdr = hdr;
> info.len = size;
> + info.file = f.file;
>
> - return load_module(&info, uargs, flags);
> + err = load_module(&info, uargs, flags);
> +out:
> + fdput(f);
> + return err;
> }
>
> static inline int within(unsigned long addr, void *start, unsigned long size)
> diff --git a/kernel/umh.c b/kernel/umh.c
> index 18e5fa4b0e71..4361c694bdb1 100644
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -97,9 +97,13 @@ static int call_usermodehelper_exec_async(void *data)
>
> commit_creds(new);
>
> - retval = do_execve(getname_kernel(sub_info->path),
> - (const char __user *const __user *)sub_info->argv,
> - (const char __user *const __user *)sub_info->envp);
> + if (sub_info->file)
> + retval = do_execve_file(sub_info->file,
> + sub_info->argv, sub_info->envp);
> + else
> + retval = do_execve(getname_kernel(sub_info->path),
> + (const char __user *const __user *)sub_info->argv,
> + (const char __user *const __user *)sub_info->envp);
> out:
> sub_info->retval = retval;
> /*
> @@ -393,6 +397,20 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
> }
> EXPORT_SYMBOL(call_usermodehelper_setup);
>
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file)
> +{
> + struct subprocess_info *sub_info;
> +
> + sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
> + if (!sub_info)
> + return NULL;
> +
> + INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
> + sub_info->path = "/dev/null";
> + sub_info->file = file;

This use of "/dev/null" here and in execve is just wrong. It _does_
have a path and filename...

Anyway, interesting idea. I think it _can_ work, it just needs much
much more careful security boundaries and to solve our autoloading
exposures too.

-Kees

--
Kees Cook
Pixel Security

2018-03-09 00:59:15

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/8/18 4:24 PM, Kees Cook wrote:
> How is this not marked [RFC]? :)
>
> On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov <[email protected]> wrote:
>> As the first step in development of bpfilter project [1] the request_module()
>> code is extended to allow user mode helpers to be invoked. Idea is that
>> user mode helpers are built as part of the kernel build and installed as
>> traditional kernel modules with .ko file extension into distro specified
>> location, such that from a distribution point of view, they are no different
>> than regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) modules via:
>>
>> request_module("foo") ->
>> call_umh("modprobe foo") ->
>> sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>> call_umh(struct file)
>
> Yikes. This means autoloaded artbitrary exec() now, instead of
> autoloading just loading arbitrary modules? That seems extremely bad.
> This just extends all the problems we've had with defining security
> boundaries with modules out to umh too. I would need some major
> convincing that this can be made safe.
>
> It's easy for container to trigger arbitrary module loading, so if it
> can find/use a similar one of the bugs we've seen in the past to
> redirect the module loading we don't just get new module-created
> attack surface added to the kernel, but we again get arbitrary ELF
> running in the root ns (not in the container). And in the insane case
> of a container with CAP_SYS_MODULE, this is a trivial escape without
> even needing to build a special kernel module: the root ns, running
> with all privileges runs an arbitrary ELF.
>
> As it stands, I have to NAK this. At the very least, you need to solve
> the execution environment problems here: the ELF should run with no
> greater privileges than what loaded the module, and very importantly,
> must not be allowed to bypass these checks through autoloading. What
> _triggered_ the autoload must be the environment, not the "modprobe",
> since that's running with full privileges. Basically, we need to fix
> module autoloading first, or at least a significant subset of the
> problems: https://lkml.org/lkml/2017/11/27/754

The above are three paragraphs of security paranoia without single
concrete example of a security issue.

>> Such approach enables kernel to delegate functionality traditionally done
>> by kernel modules into user space processes (either root or !root) and
>> reduces security attack surface of such new code, meaning in case of
>> potential bugs only the umh would crash but not the kernel. Another
>> advantage coming with that would be that bpfilter.ko can be debugged and
>> tested out of user space as well (e.g. opening the possibility to run
>> all clang sanitizers, fuzzers or test suites for checking translation).
>> Also, such architecture makes the kernel/user boundary very precise:
>> control plane is done by the user space while data plane stays in the kernel.
>>
>> It's easy to distinguish "umh module" from traditional kernel module:
>>
>> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
>> Type: EXEC (Executable file)
>
> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
> like forcing the userspace helper to be non-PIE would defeat some of
> the userspace defenses in use in most distros.

because we don't add features without concrete users.

>> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
>> Type: REL (Relocatable file)
>>
>> Since umh can crash, can be oom-ed by the kernel, killed by admin,
>> the subsystem that uses them (like bpfilter) need to manage life
>> time of umh on its own, so module infra doesn't do any accounting
>> of them. They don't appear in "lsmod" and cannot be "rmmod".
>> Multiple request_module("umh") will load multiple umh.ko processes.
>>
>> Similar to kernel modules the kernel will be tainted if "umh module"
>> has invalid signature.
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_747551_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=pMlEM-qKOmGljadUKAdwKBpuYHQRXzApvSGkZFIT4Gg&s=0-vP6LH-GxXnL7LuV-KfMl1NbqsVJUINSygoVa9R6nk&e=
>>
>> Signed-off-by: Alexei Starovoitov <[email protected]>
>> ---
>> fs/exec.c | 40 +++++++++++++++++++++++++++++++---------
>> include/linux/binfmts.h | 1 +
>> include/linux/umh.h | 3 +++
>> kernel/module.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>> kernel/umh.c | 24 +++++++++++++++++++++---
>> 5 files changed, 94 insertions(+), 17 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 7eb8d21bcab9..0483c438de7d 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
>> /*
>> * sys_execve() executes a new program.
>> */
>> -static int do_execveat_common(int fd, struct filename *filename,
>> - struct user_arg_ptr argv,
>> - struct user_arg_ptr envp,
>> - int flags)
>> +static int __do_execve_file(int fd, struct filename *filename,
>> + struct user_arg_ptr argv,
>> + struct user_arg_ptr envp,
>> + int flags, struct file *file)
>> {
>> char *pathbuf = NULL;
>> struct linux_binprm *bprm;
>> - struct file *file;
>> struct files_struct *displaced;
>> int retval;
>>
>> @@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename *filename,
>> check_unsafe_exec(bprm);
>> current->in_execve = 1;
>>
>> - file = do_open_execat(fd, filename, flags);
>> + if (!file)
>> + file = do_open_execat(fd, filename, flags);
>> retval = PTR_ERR(file);
>> if (IS_ERR(file))
>> goto out_unmark;
>> @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
>> sched_exec();
>>
>> bprm->file = file;
>> - if (fd == AT_FDCWD || filename->name[0] == '/') {
>> + if (!filename) {
>> + bprm->filename = "/dev/null";
>> + } else if (fd == AT_FDCWD || filename->name[0] == '/') {
>> bprm->filename = filename->name;
>> } else {
>> if (filename->name[0] == '\0')
>> @@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename *filename,
>> task_numa_free(current);
>> free_bprm(bprm);
>> kfree(pathbuf);
>> - putname(filename);
>> + if (filename)
>> + putname(filename);
>> if (displaced)
>> put_files_struct(displaced);
>> return retval;
>> @@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename *filename,
>> if (displaced)
>> reset_files_struct(displaced);
>> out_ret:
>> - putname(filename);
>> + if (filename)
>> + putname(filename);
>> return retval;
>> }
>>
>> +static int do_execveat_common(int fd, struct filename *filename,
>> + struct user_arg_ptr argv,
>> + struct user_arg_ptr envp,
>> + int flags)
>> +{
>> + struct file *file = NULL;
>> +
>> + return __do_execve_file(fd, filename, argv, envp, flags, file);
>> +}
>> +
>> +int do_execve_file(struct file *file, void *__argv, void *__envp)
>> +{
>> + struct user_arg_ptr argv = { .ptr.native = __argv };
>> + struct user_arg_ptr envp = { .ptr.native = __envp };
>> +
>> + return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
>> +}
>
> The exec.c changes should be split into a separate patch. Something
> feels incorrectly refactored here, though. Passing all three of fd,
> filename, and file to __do_execve_file() seems wrong; perhaps the fd
> to file handling needs to happen externally in what you have here as
> do_execveat_common()? The resulting __do_execve_file() has repeated
> conditionals on filename... maybe what I object to is being able to
> pass a NULL filename at all. The filename can be (painfully)
> reconstructed from the struct file.

reconstruct the path and instantly introduce the race between execing
something by path vs prior check that it's actual elf of already
opened file ?!
excellent suggestion to improve security.

>> [...]
>> diff --git a/kernel/module.c b/kernel/module.c
>> index ad2d420024f6..6cfa35795741 100644
>> --- a/kernel/module.c
>> +++ b/kernel/module.c
>> [...]
>> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
>> |MODULE_INIT_IGNORE_VERMAGIC))
>> return -EINVAL;
>>
>> - err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
>> - READING_MODULE);
>> + f = fdget(fd);
>> + if (!f.file)
>> + return -EBADF;
>> +
>> + err = kernel_read_file(f.file, &hdr, &size, INT_MAX, READING_MODULE);
>
> For the LSM subsystem, I think this should also get it's own enum
> kernel_read_file_id. This is really no longer a kernel module...

at this point it's a _file_. It could have been text file just well.
If lsm is thinking that at this point kernel is processing
kernel module that lsm is badly broken.

>> if (err)
>> - return err;
>> + goto out;
>> info.hdr = hdr;
>> info.len = size;
>> + info.file = f.file;
>>
>> - return load_module(&info, uargs, flags);
>> + err = load_module(&info, uargs, flags);
>> +out:
>> + fdput(f);
>> + return err;
>> }
>>
>> static inline int within(unsigned long addr, void *start, unsigned long size)
>> diff --git a/kernel/umh.c b/kernel/umh.c
>> index 18e5fa4b0e71..4361c694bdb1 100644
>> --- a/kernel/umh.c
>> +++ b/kernel/umh.c
>> @@ -97,9 +97,13 @@ static int call_usermodehelper_exec_async(void *data)
>>
>> commit_creds(new);
>>
>> - retval = do_execve(getname_kernel(sub_info->path),
>> - (const char __user *const __user *)sub_info->argv,
>> - (const char __user *const __user *)sub_info->envp);
>> + if (sub_info->file)
>> + retval = do_execve_file(sub_info->file,
>> + sub_info->argv, sub_info->envp);
>> + else
>> + retval = do_execve(getname_kernel(sub_info->path),
>> + (const char __user *const __user *)sub_info->argv,
>> + (const char __user *const __user *)sub_info->envp);
>> out:
>> sub_info->retval = retval;
>> /*
>> @@ -393,6 +397,20 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
>> }
>> EXPORT_SYMBOL(call_usermodehelper_setup);
>>
>> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file)
>> +{
>> + struct subprocess_info *sub_info;
>> +
>> + sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
>> + if (!sub_info)
>> + return NULL;
>> +
>> + INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
>> + sub_info->path = "/dev/null";
>> + sub_info->file = file;
>
> This use of "/dev/null" here and in execve is just wrong. It _does_
> have a path and filename...

already answered that earlier in the thread.


2018-03-09 01:01:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 12:24 AM, Kees Cook <[email protected]> wrote:
> How is this not marked [RFC]? :)
>
> On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov <[email protected]> wrote:
>> As the first step in development of bpfilter project [1] the request_module()
>> code is extended to allow user mode helpers to be invoked. Idea is that
>> user mode helpers are built as part of the kernel build and installed as
>> traditional kernel modules with .ko file extension into distro specified
>> location, such that from a distribution point of view, they are no different
>> than regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) modules via:
>>
>> request_module("foo") ->
>> call_umh("modprobe foo") ->
>> sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>> call_umh(struct file)
>
> Yikes. This means autoloaded artbitrary exec() now, instead of
> autoloading just loading arbitrary modules? That seems extremely bad.
> This just extends all the problems we've had with defining security
> boundaries with modules out to umh too. I would need some major
> convincing that this can be made safe.
>
> It's easy for container to trigger arbitrary module loading, so if it
> can find/use a similar one of the bugs we've seen in the past to
> redirect the module loading we don't just get new module-created
> attack surface added to the kernel, but we again get arbitrary ELF
> running in the root ns (not in the container). And in the insane case
> of a container with CAP_SYS_MODULE, this is a trivial escape without
> even needing to build a special kernel module: the root ns, running
> with all privileges runs an arbitrary ELF.
>
> As it stands, I have to NAK this. At the very least, you need to solve
> the execution environment problems here: the ELF should run with no
> greater privileges than what loaded the module, and very importantly,
> must not be allowed to bypass these checks through autoloading. What
> _triggered_ the autoload must be the environment, not the "modprobe",
> since that's running with full privileges. Basically, we need to fix
> module autoloading first, or at least a significant subset of the
> problems: https://lkml.org/lkml/2017/11/27/754

I disagree. If we're going to somehow have ELF binaries that pretend
to be modules, then they should run with high privilege and, more
importantly, should run in a context independent of whoever triggered
autoloading.

Also, I don't see how this is any more exploitable than any other
init_module(). If someone has CAP_SYS_MODULE or autoload privileges
and they can trigger init_module() on a file, then they're granting
maximum privilege to the contents of that file. So who cares if that
file is a kernel module or an ELF binary?

Alexei, can you give an example use case? I'm sure it's upthread
somewhere, but I'm having trouble finding it.

Also, I just tested this concept a bit. Depmod invoked explicitly on
an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel
that has a "module" like that seems to work fine. Go figure.

2018-03-09 01:06:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov <[email protected]> wrote:
> On 3/8/18 4:24 PM, Kees Cook wrote:
>>
>> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
>> like forcing the userspace helper to be non-PIE would defeat some of
>> the userspace defenses in use in most distros.
>
>
> because we don't add features without concrete users.

I disagree here. If you're going to add a magic trick that triggers
an execve(), then I think you should either support *both* standard,
widely-used types of ELF programs or you should give a compelling use
case that works for ET_EXEC but not for ET_DYN. Keep in mind that
many distros have a very strong preference for ET_DYN.

Or you could argue that ET_DYN requires tooling changes, but I think
it's awkward to ask the tooling to change in advance of the kernel
being willing to actually invoke the thing. I'm not actually
convinced that any tooling changes would be needed, though.

2018-03-09 01:22:10

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 09, 2018 at 12:59:36AM +0000, Andy Lutomirski wrote:
>
> Alexei, can you give an example use case? I'm sure it's upthread
> somewhere, but I'm having trouble finding it.

at the time of iptable's setsockopt() the kernel will do
err = request_module("bpfilter");
once.
The rough POC code:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25

> Also, I just tested this concept a bit. Depmod invoked explicitly on
> an ET_EXEC with a.ko extension gets mad, but depmod -a on a kernel
> that has a "module" like that seems to work fine. Go figure.

right. that's with the current patch.
In v2 I require .modinfo section to make sure license is specified,
but depmod still not very happy:
$ depmod /lib/modules/`uname -r`/kernel/net/bpfilter/bpfilter.ko
depmod: ERROR: Bad version passed /lib/modules/4.16.0-rc4-00799-g1716f0aa3039-dirty/kernel/net/bpfilter/bpfilter.ko
I'm not sure it's worth to silence it, since
as you noticed 'depmod -a' works.


2018-03-09 01:25:23

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 8, 2018 at 4:57 PM, Alexei Starovoitov <[email protected]> wrote:
> The above are three paragraphs of security paranoia without single
> concrete example of a security issue.

How is running an arbitrary ELF as full init_ns root from a container
not a concrete example?

I'm not saying this approach can never happen. And this isn't
paranoia. There are very real security boundary violations in this
model, IMO. Though, as Andy says, it's more about autoloading than umh
itself. I just don't want to extend that problem further.

>> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
>> like forcing the userspace helper to be non-PIE would defeat some of
>> the userspace defenses in use in most distros.
>
>
> because we don't add features without concrete users.

It's just a two-line change, and then it would work on distros that
build PIE-by-default. That seems like a concrete use-case.

>> The exec.c changes should be split into a separate patch. Something
>> feels incorrectly refactored here, though. Passing all three of fd,
>> filename, and file to __do_execve_file() seems wrong; perhaps the fd
>> to file handling needs to happen externally in what you have here as
>> do_execveat_common()? The resulting __do_execve_file() has repeated
>> conditionals on filename... maybe what I object to is being able to
>> pass a NULL filename at all. The filename can be (painfully)
>> reconstructed from the struct file.
>
> reconstruct the path and instantly introduce the race between execing
> something by path vs prior check that it's actual elf of already
> opened file ?!
> excellent suggestion to improve security.

I'm not sure why you're being so hostile. We're both interested in
improving the kernel.

Path names aren't stable, but they provide good _debugging_ about
things when needed. For example, the LSM hooks, if they report on one
of these loads, can produce a hint to the user about what happened.
Passing "/dev/null" around is confusing at the very least. The ELF is
absolutely not /dev/null. Why make someone guess?

>>> [...]
>>> diff --git a/kernel/module.c b/kernel/module.c
>>> index ad2d420024f6..6cfa35795741 100644
>>> --- a/kernel/module.c
>>> +++ b/kernel/module.c
>>> [...]
>>> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char
>>> __user *, uargs, int, flags)
>>> |MODULE_INIT_IGNORE_VERMAGIC))
>>> return -EINVAL;
>>>
>>> - err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
>>> - READING_MODULE);
>>> + f = fdget(fd);
>>> + if (!f.file)
>>> + return -EBADF;
>>> +
>>> + err = kernel_read_file(f.file, &hdr, &size, INT_MAX,
>>> READING_MODULE);
>>
>>
>> For the LSM subsystem, I think this should also get it's own enum
>> kernel_read_file_id. This is really no longer a kernel module...
>
>
> at this point it's a _file_. It could have been text file just well.
> If lsm is thinking that at this point kernel is processing
> kernel module that lsm is badly broken.

Again, this is about making things more discoverable. We already know
if we're loading a kernel module or a umh ELF. Passing this
information to the LSM is valuable to the LSM to distinguish between
types of files. They have very different effects on the system, for
example. The issue is about validating intent with target. "Is this
kernel module allowed?" and "Is this umh ELF allowed?" are better
questions that "Is something loaded through finit_module() allowed?"
Note already that the LSM distinguishes between many other file types
in kernel_read_file*().

-Kees

--
Kees Cook
Pixel Security

2018-03-09 01:27:09

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 09, 2018 at 01:04:39AM +0000, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 12:57 AM, Alexei Starovoitov <[email protected]> wrote:
> > On 3/8/18 4:24 PM, Kees Cook wrote:
> >>
> >> As Andy asked earlier, why not DYN too to catch PIE executables? Seems
> >> like forcing the userspace helper to be non-PIE would defeat some of
> >> the userspace defenses in use in most distros.
> >
> >
> > because we don't add features without concrete users.
>
> I disagree here. If you're going to add a magic trick that triggers
> an execve(), then I think you should either support *both* standard,
> widely-used types of ELF programs or you should give a compelling use
> case that works for ET_EXEC but not for ET_DYN. Keep in mind that
> many distros have a very strong preference for ET_DYN.

misunderstanding here is that this bpfiler.ko is part of _kernel build_.
Kernel decides how it's build.
Nothing to do with distros.
Current Makefile is very dumb and it's built with HOSTCC:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/bpfilter/Makefile?h=ipt_bpf
but it will be standalone with CC before final to make sure crosscompiling works.


2018-03-09 01:39:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski <[email protected]> wrote:
>
> Also, I don't see how this is any more exploitable than any other
> init_module().

Absolutely. If Kees doesn't trust the files to be loaded, an
executable - even if it's running with root privileges and in the
initns - is still fundamentally weaker than a kernel module.

So I don't understand the security argument AT ALL. It's nonsensical.
The executable loading does all the same security checks that the
module loading does, including the signing check.

And the whole point is that we can now do things with building and
loading a ebpf rule instead of having a full module.

Linus

2018-03-09 01:45:30

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 8, 2018 at 5:38 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Mar 8, 2018 at 4:59 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Also, I don't see how this is any more exploitable than any other
>> init_module().
>
> Absolutely. If Kees doesn't trust the files to be loaded, an
> executable - even if it's running with root privileges and in the
> initns - is still fundamentally weaker than a kernel module.
>
> So I don't understand the security argument AT ALL. It's nonsensical.
> The executable loading does all the same security checks that the
> module loading does, including the signing check.
>
> And the whole point is that we can now do things with building and
> loading a ebpf rule instead of having a full module.

My concerns are mostly about crossing namespaces. If a container
triggers an autoload, the result runs in the init_ns. So, really,
there's nothing new from my perspective, except that it's in userspace
instead of in the kernel.

Perhaps it's an orthogonal concern.

-Kees

--
Kees Cook
Pixel Security

2018-03-09 01:59:41

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 08, 2018 at 03:07:01PM -0800, Alexei Starovoitov wrote:
> On 3/7/18 5:23 PM, Luis R. Rodriguez wrote:
> >
> > request_module() has its own world though too. How often in your proof of
> > concept is request_module() called? How many times do you envision it being
> > called?
>
> once.

What about other users later in the kernel?

> > > +static int run_umh(struct file *file)
> > > +{
> > > + struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
> > > +
> > > + if (!sub_info)
> > > + return -ENOMEM;
> > > + return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> > > +}
> >
> > run_umh() calls the program and waits. Note that while we are running a UMH we
> > can't suspend. What if they take forever, who is hosing them down with an
> > equivalent:
> >
> > schedule();
> > try_to_freeze();
> >
> > Say they are buggy and never return, will they stall suspend, have you tried?
> >
> > call_usermodehelper_exec() uses helper_lock() which in turn is used for
> > umh's accounting for number of running umh's. One of the sad obscure uses
> > for this is to deal with suspend/resume. Refer to __usermodehelper* calls
> > on kernel/power/process.c
> >
> > Note how you use UMH_WAIT_EXEC too, so this is all synchronous.
>
> looks like you misread this code
>
> #define UMH_NO_WAIT 0 /* don't wait at all */
> #define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */
> #define UMH_WAIT_PROC 2 /* wait for the process to complete */
> #define UMH_KILLABLE 4 /* wait for EXEC/PROC killable */

I certainly did get the incorrect impression this was a sync op, sorry
about that. In that case its odd to see a request_module() call, when
what is really meant is more of a request_module_nowait(), its also
UMH_NO_WAIT on the modprobe call itself as well.

In fact I think I'd much prefer at the very least to see a different wrapper
for this, even if its using the same bolts and nuts underneath for userspace
processes compiled on the kernel. The sanity checks in place for
request_module() may change later and this use case seems rather simple and
limited. Keeping tabs on this type of users seems desirable. At the *very
least*

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 40c89ad4bea6..7530e00da03b 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -37,11 +37,13 @@ extern __printf(2, 3)
int __request_module(bool wait, const char *name, ...);
#define request_module(mod...) __request_module(true, mod)
#define request_module_nowait(mod...) __request_module(false, mod)
+#define request_umh_proc(mod...) __request_module(false, mod)
#define try_then_request_module(x, mod...) \
((x) ?: (__request_module(true, mod), (x)))
#else
static inline int request_module(const char *name, ...) { return -ENOSYS; }
static inline int request_module_nowait(const char *name, ...) { return -ENOSYS; }
+static inline int request_umh_proc(const char *name, ...) { return -ENOSYS; }
#define try_then_request_module(x, mod...) (x)
#endif

Modulo, kernel/umh.c is already its own thing, so pick a name that helps
identify these things clearly.

> and the rest of your concerns with suspend/resume are not applicable any
> more.

Agreed, except it does still mean these userspace processes are limited to
execution only the kernel is up, and not going to suspend, but I think that's
a proper sanity check by the umh API.

> bpftiler.ko is started once and runs non-stop from there on.
> Unless it crashes, but it's a different discussion.

Sure.

Luis

2018-03-09 02:14:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov
<[email protected]> wrote:
> On Fri, Mar 09, 2018 at 12:59:36AM +0000, Andy Lutomirski wrote:
>>
>> Alexei, can you give an example use case? I'm sure it's upthread
>> somewhere, but I'm having trouble finding it.
>
> at the time of iptable's setsockopt() the kernel will do
> err = request_module("bpfilter");
> once.
> The rough POC code:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25

Here's what I gather from reading that code: you have a new kernel
feature (consisting of actual kernel code) that wants to defer some of
its implementation to user mode. I like this idea a lot. But I have
a suggestion for a slightly different way of accomplishing the same
thing. Rather than extending init_module() to accept ELF input,
except the call_umh code to be able to call blobs. You'd use it it
very roughly like this:

First, compile your user code and emit a staitc binary. Use objdump
fiddling or a trivial .S file to make that static binary into a
variable. Then write a tiny shim module like this:

extern unsigned char __begin_user_code[], __end_user_code[];

int __init init_shim_module(void)
{
return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code);
}

By itself, this is clearly a worse solution than yours, but it has two
benefits, one small and two big. The small benefit is that it is
completely invisible to userspace: the .ko file is a bona fide module.
The big benefits are:

1. It works even in a non-modular kernel! (Okay, it probably only
works if you can arrange for the built-in module to be initialized
late enough, but that's straightforward.)

2. It allows future extensions to change the way the glue works. For
example, maybe you want the module to integrate properly with lsmod,
etc. Rather than adding a mechanism for general privileged programs
to register themselves with lsmod (ick!), you could do it entirely in
the kernel where lsmod would know that a particular umh task is
special. More usefully, you could extend call_umh_blob() to pass in
some pre-initialized struct files, which would give a clean way to
*synchronously* create a communication channel to user code for
whatever service the user code provides. And it would be more
straightforward to make the umh blob do what it needs to do without
relying on any particular filesystems being mounted.

I think we don't want to end up in a situation where we ship a program
with a .ko extension that opens something in /dev, for example.

call_umh_blob() would create an anon_inode or similar object backed by
the blob and exec it.

2018-03-09 02:33:40

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

From: Andy Lutomirski <[email protected]>
Date: Fri, 9 Mar 2018 02:12:24 +0000

> First, compile your user code and emit a staitc binary. Use objdump
> fiddling or a trivial .S file to make that static binary into a
> variable. Then write a tiny shim module like this:
>
> extern unsigned char __begin_user_code[], __end_user_code[];
>
> int __init init_shim_module(void)
> {
> return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code);
> }
>
> By itself, this is clearly a worse solution than yours, but it has two
> benefits, one small and two big. The small benefit is that it is
> completely invisible to userspace: the .ko file is a bona fide module.

Anything you try to do which makes these binaries "special" is a huge
negative.

> The big benefits are:

I don't see those things as benefits at all, and Alexei's scheme can
easily be made to work in your benefit #1 case too.

It's a user binary. It's shipped with the kernel and it's signed.

If we can't trust that, we can't trust much else.

And this whole container argument.. It's a mirage.

Kernel modules are 1000 times worse, since they can access any
container and any namespace they want.

2018-03-09 03:07:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 8, 2018 at 5:44 PM, Kees Cook <[email protected]> wrote:
>
> My concerns are mostly about crossing namespaces. If a container
> triggers an autoload, the result runs in the init_ns.

Heh. I saw that as an advantage. It's basically the same semantics as
a normal module load does - in that the "kernel namespace" is init_ns.

My own personal worry is actually different - we do check the
signature of the file we're loading, but we're then passing it off to
execve() not as the image we loaded, but as the file pointer.

So the execve() will end up not using the actual buffer we checked the
signature on, but instead just re-reading the file.

Now, that has two issues:

(a) it means that 'init_module' doesn't work, only 'finit_module'.

Not a big deal, but I do think that we should fail the 'info->file
== NULL' case explicitly (instead of failing when it does an execve()
of /dev/null, which is what I think it does now - but that's just from
the reading the patch, not from testing it).

(b) somebody could maybe try to time it and modify the file
after-the-fact of the signature check, and then we execute something
else.

Honestly, that "read twice" thing may be what scuttles this.
Initially, I thought it was a non-issue, because anybody who controls
the module subdirectory enough to rewrite files would be in a position
to just execute the file itself directly instead.

But it turns out that isn't needed. Some bad actor could just do
"finit_module()" with a file that they just *copied* from the module
directory.

Yes, yes, they still need CAP_SYS_MODULE to even get that far, but it
does worry me.

So this does seem to turn "CAP_SYS_MODULE" into meaning "can run any
executable as root in the init namespace". They don't have to have the
module signing key that I thought would protect us, because they can
do that "rewrite after signature check".

So that does seem like a bad problem with the approach.

So I don't like Andy's "let's make it a kernel module and then that
kernel module can execve() a blob". THAT seems like just stupid
indirection.

But I do like Andy's "execve a blob" part, because it is the *blob*
that has had its signature verified, not the file!

Linus

2018-03-09 03:12:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries




> On Mar 8, 2018, at 6:31 PM, David Miller <[email protected]> wrote:
>
> From: Andy Lutomirski <[email protected]>
> Date: Fri, 9 Mar 2018 02:12:24 +0000
>
>> First, compile your user code and emit a staitc binary. Use objdump
>> fiddling or a trivial .S file to make that static binary into a
>> variable. Then write a tiny shim module like this:
>>
>> extern unsigned char __begin_user_code[], __end_user_code[];
>>
>> int __init init_shim_module(void)
>> {
>> return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code);
>> }
>>
>> By itself, this is clearly a worse solution than yours, but it has two
>> benefits, one small and two big. The small benefit is that it is
>> completely invisible to userspace: the .ko file is a bona fide module.
>
> Anything you try to do which makes these binaries "special" is a huge
> negative.

I don’t know what you mean. Alexei’s approach introduces a whole new kind of special module. Mine doesn’t.

>
>> The big benefits are:
>
> I don't see those things as benefits at all, and Alexei's scheme can
> easily be made to work in your benefit #1 case too.
>

How? I think you’ll find that a non-modular implementation of a bundled ELF binary looks a *lot* like my call_umh_blob().

> It's a user binary. It's shipped with the kernel and it's signed.
>
> If we can't trust that, we can't trust much else.

I’m not making any arguments about security at all. I’m talking about functionality.

If we apply Alexei’s patch as is, then I think we’ll have a situation where ET_EXEC modules are only useful if they can do their jobs without any filesystem access at all. This is fine for networking, where netlink sockets are used, but I think it’s not so great for other use cases. If we ever try to stick a usb driver into userspace, we’re going to want to instantiate the user task once per device, passed as stdin or similar, and Alexei’s code will make that very awkward.


2018-03-09 03:19:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 8, 2018 at 7:06 PM, Linus Torvalds
<[email protected]> wrote:
>
> So I don't like Andy's "let's make it a kernel module and then that
> kernel module can execve() a blob". THAT seems like just stupid
> indirection.
>
> But I do like Andy's "execve a blob" part, because it is the *blob*
> that has had its signature verified, not the file!

To be honest., Andy's approach has the advantage that all the
synchronization we do for modules still works.

In particular, we have module loading logic where we synchronize
loading of modules with the same name, so that two things that do
request_module() concurrently will not lead to two modules being
loaded.

See that whole "module_mutex" thing, and the logic in (for example)
"add_unformed_module()".

Andrei's patch short-circuits the module loading before that, so if
you have two concurrent request_module() calls, you'll end up with no
serialization, and two executables.

So maybe Andy is right. He often is, after all.

Linus

2018-03-09 03:28:41

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 09, 2018 at 02:12:24AM +0000, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Fri, Mar 09, 2018 at 12:59:36AM +0000, Andy Lutomirski wrote:
> >>
> >> Alexei, can you give an example use case? I'm sure it's upthread
> >> somewhere, but I'm having trouble finding it.
> >
> > at the time of iptable's setsockopt() the kernel will do
> > err = request_module("bpfilter");
> > once.
> > The rough POC code:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25
>
> Here's what I gather from reading that code: you have a new kernel
> feature (consisting of actual kernel code) that wants to defer some of
> its implementation to user mode. I like this idea a lot. But I have
> a suggestion for a slightly different way of accomplishing the same
> thing. Rather than extending init_module() to accept ELF input,
> except the call_umh code to be able to call blobs. You'd use it it
> very roughly like this:
>
> First, compile your user code and emit a staitc binary. Use objdump
> fiddling or a trivial .S file to make that static binary into a
> variable. Then write a tiny shim module like this:
>
> extern unsigned char __begin_user_code[], __end_user_code[];
>
> int __init init_shim_module(void)
> {
> return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code);
> }
>
> By itself, this is clearly a worse solution than yours, but it has two
> benefits, one small and two big. The small benefit is that it is
> completely invisible to userspace: the .ko file is a bona fide module.

Unfortunately it's not quite the case.
The normal .ko that does call_umh_blob is indeed seen in lsmod,
but the umh process is a separate task.
It could have been oomed or killed by admin and
this normal .ko wouldn't notice it, so health check of umh process
by the kernel is needed regardless.
Right now bpfilter has trivial fuse-like protocol. This part
is still to be designed cleanly.

No doubt that visibility and debuggability into this umh processes
is must have, but lsmod/rmmod interface doesn't quite fit.
As you said letting this priv tasks register themselves in lsmod
is certainly no-go.
I think if they will be in lsmod, kernel has to register them
and establish health check with umh at the same time.
I think worrying about restarting is not necessary.
This is still kernel code with the same high standards and
review process. If they crash it's really a kernel bug.
It only doesn't take the system down.

> I think we don't want to end up in a situation where we ship a program
> with a .ko extension that opens something in /dev, for example.

this part I don't get. What's wrong with open of /dev ?
I don't see a use case for it, but technically why not?

> call_umh_blob() would create an anon_inode or similar object backed by
> the blob and exec it.

Interesting. I haven't considered such approach.

For full context it all started from the idea of 'unprivileged kernel modules'
or 'hardened kernel modules'. Something that kernel can easily
interact with, but much safer than traditional kernel module.

I've tried a bunch of crappy ideas first:
1. have a piece of kernel .text vm_mmap-ed into user process that
doing iptables setsockopt and on return to user space force handle_signal
to execute that code. Sort of like forced ld_preload where parasite
code is provided by the kernel but runs in user space
2. have a special set of kernel page tables in read-only mode while
iptable->bpf conversion is happening
3. have load_module() fork a user task and load real kernel .ko into it

trying to hack #3 realized that I'm mainly copy-pasting a lot of
load_elf_binary() code without elf_interpreter bits,
so figured it's much easier and simpler to blend sys_finit_module
with load_elf_binary via tweaking do_execveat_common and keeping
that .ko as normal elf which is implemented in this patch.
Debugging of that fake .ko is so much better.
If it's done via call_umh_blob() the process that starts
will indeed be a user mode process, but you won't be able to
attach gdb to it. Whereas in this patch it's normal elf and
standard debugging techniques apply. A developer can
do gdb breakpoints, debug info, etc.
That is huge advantage of keeping it as normal elf.


2018-03-09 03:56:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries




> On Mar 8, 2018, at 7:06 PM, Linus Torvalds <[email protected]> wrote:
>
>
> Honestly, that "read twice" thing may be what scuttles this.
> Initially, I thought it was a non-issue, because anybody who controls
> the module subdirectory enough to rewrite files would be in a position
> to just execute the file itself directly instead.
>

On further consideration, I think there’s another showstopper. This patch is a potentially severe ABI break. Right now, loading a module *copies* it into memory and does not hold a reference to the underlying fs. With the patch applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, then umount it, then clear the ramdisk, something will go horribly wrong. Exactly what goes wrong depends on whether userspace notices that umount() failed. Similarly, if you load one of these modules over a network and then lose your connection, you have a problem.


The “read twice” thing is also bad for another reason: containers. Suppose I have a setup where a container can load a signed module blob. With the read twice code, the container can race and run an entirely different blob outside the container.

2018-03-09 05:10:25

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>
>
>
>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds <[email protected]> wrote:
>>
>>
>> Honestly, that "read twice" thing may be what scuttles this.
>> Initially, I thought it was a non-issue, because anybody who controls
>> the module subdirectory enough to rewrite files would be in a position
>> to just execute the file itself directly instead.
>>
>
> On further consideration, I think there’s another showstopper. This patch is a potentially severe ABI break. Right now, loading a module *copies* it into memory and does not hold a reference to the underlying fs. With the patch applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, then umount it, then clear the ramdisk, something will go horribly wrong. Exactly what goes wrong depends on whether userspace notices that umount() failed. Similarly, if you load one of these modules over a network and then lose your connection, you have a problem.

there is not abi breakage and file cannot disappear from running task.
One cannot umount fs while file is still being used.

>
> The “read twice” thing is also bad for another reason: containers. Suppose I have a setup where a container can load a signed module blob. With the read twice code, the container can race and run an entirely different blob outside the container.

Not only "read twice", but "read many".
If .text sections of elf that are not yet in memory can be modified
by malicious user, later they will be brought in with different code.
I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.


2018-03-09 15:17:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

>> On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov <[email protected]> wrote:
>>
>> On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>>
>>
>>
>>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds <[email protected]> wrote:
>>>
>>>
>>> Honestly, that "read twice" thing may be what scuttles this.
>>> Initially, I thought it was a non-issue, because anybody who controls
>>> the module subdirectory enough to rewrite files would be in a position
>>> to just execute the file itself directly instead.
>>
>> On further consideration, I think there’s another showstopper. This patch is a potentially severe ABI break. Right now, loading a module *copies* it into memory and does not hold a reference to the underlying fs. With the patch applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, then umount it, then clear the ramdisk, something will go horribly wrong. Exactly what goes wrong depends on whether userspace notices that umount() failed. Similarly, if you load one of these modules over a network and then lose your connection, you have a problem.
>
> there is not abi breakage and file cannot disappear from running task.
> One cannot umount fs while file is still being used.

Sure it is. Without your patch, init_module doesn’t keep using the
file, so it’s common practice to load a module and then delete or
unmount it. With your patch, the unmount case breaks. This is likely
to break existing userspace, so, in Linux speak it’s an ABI break.

>
>>
>> The “read twice” thing is also bad for another reason: containers. Suppose I have a setup where a container can load a signed module blob. With the read twice code, the container can race and run an entirely different blob outside the container.
>
> Not only "read twice", but "read many".
> If .text sections of elf that are not yet in memory can be modified
> by malicious user, later they will be brought in with different code.
> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.

Given this issue, I think the patch would need Kees’s explicit ack. I
had initially thought your patch had minimal security impact, but I
was wrong Module security is very complicated and needs to satisfy a
bunch of requirements. There is a lot of code in the kernel that
assumes that it’s sufficient to verify a module once at load time,
your patch changes that, and this has all kinds of nasty interactions
with autoloading. Kees is very reasonable, and he’ll change his mind
and ack a patch that he’s nacked when presented with a valid technical
argument.

But I think my ABI break observation is also a major problem, and
Linus is going to be pissed if this thing lands in his tree and breaks
systems due to an issue that was raised during review. So I think you
need to either rework the patch or do a serious survey of how all the
distros deal with modules (dracut, initramfs-tools, all the older
stuff, and probably more) and make sure they can all handle your
patch. I'd also be concerned about anyone who puts /lib/modules on
less reliable storage than they use for the rest of their system. (I
know it's quite common to have /boot be the only non-RAID partition on
a system, but modules don't generally live in /boot.)

Also, I think you really ought to explain how your approach will work
with MODULES=n or convince Linus that it’s okay to start adding core
networking features that don’t work with MODULES=n and can't be built
into the main kernel image.

2018-03-09 15:42:08

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/9/18 7:16 AM, Andy Lutomirski wrote:
>>> On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov <[email protected]> wrote:
>>>
>>> On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>>>
>>>
>>>
>>>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds <[email protected]> wrote:
>>>>
>>>>
>>>> Honestly, that "read twice" thing may be what scuttles this.
>>>> Initially, I thought it was a non-issue, because anybody who controls
>>>> the module subdirectory enough to rewrite files would be in a position
>>>> to just execute the file itself directly instead.
>>>
>>> On further consideration, I think there’s another showstopper. This patch is a potentially severe ABI break. Right now, loading a module *copies* it into memory and does not hold a reference to the underlying fs. With the patch applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, then umount it, then clear the ramdisk, something will go horribly wrong. Exactly what goes wrong depends on whether userspace notices that umount() failed. Similarly, if you load one of these modules over a network and then lose your connection, you have a problem.
>>
>> there is not abi breakage and file cannot disappear from running task.
>> One cannot umount fs while file is still being used.
>
> Sure it is. Without your patch, init_module doesn’t keep using the
> file, so it’s common practice to load a module and then delete or
> unmount it. With your patch, the unmount case breaks. This is likely
> to break existing userspace, so, in Linux speak it’s an ABI break.

please read the patch again.
file is only used in case of umh modules.
There is zero difference in default case.

>>
>>>
>>> The “read twice” thing is also bad for another reason: containers. Suppose I have a setup where a container can load a signed module blob. With the read twice code, the container can race and run an entirely different blob outside the container.
>>
>> Not only "read twice", but "read many".
>> If .text sections of elf that are not yet in memory can be modified
>> by malicious user, later they will be brought in with different code.
>> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.
>
> Given this issue, I think the patch would need Kees’s explicit ack. I
> had initially thought your patch had minimal security impact, but I
> was wrong Module security is very complicated and needs to satisfy a
> bunch of requirements. There is a lot of code in the kernel that
> assumes that it’s sufficient to verify a module once at load time,
> your patch changes that, and this has all kinds of nasty interactions
> with autoloading.

not true. you misread the patch and making incorrect conclusions.

> Kees is very reasonable, and he’ll change his mind
> and ack a patch that he’s nacked when presented with a valid technical
> argument.
>
> But I think my ABI break observation is also a major problem, and
> Linus is going to be pissed if this thing lands in his tree and breaks
> systems due to an issue that was raised during review. So I think you
> need to either rework the patch or do a serious survey of how all the

I think you need to stop overreacting on non-issue.

> distros deal with modules (dracut, initramfs-tools, all the older
> stuff, and probably more) and make sure they can all handle your
> patch. I'd also be concerned about anyone who puts /lib/modules on
> less reliable storage than they use for the rest of their system. (I
> know it's quite common to have /boot be the only non-RAID partition on
> a system, but modules don't generally live in /boot.)
>
> Also, I think you really ought to explain how your approach will work
> with MODULES=n or convince Linus that it’s okay to start adding core
> networking features that don’t work with MODULES=n and can't be built
> into the main kernel image.

that ship sailed long ago.
config BPF_JIT
bool "enable BPF Just In Time compiler"
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
depends on MODULES



2018-03-09 16:26:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 3:39 PM, Alexei Starovoitov <[email protected]> wrote:
> On 3/9/18 7:16 AM, Andy Lutomirski wrote:
>>>>
>>>> On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>
>>>> On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>>>>
>>>>
>>>>
>>>>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds
>>>>> <[email protected]> wrote:
>>>>>
>>>>>
>>>>> Honestly, that "read twice" thing may be what scuttles this.
>>>>> Initially, I thought it was a non-issue, because anybody who controls
>>>>> the module subdirectory enough to rewrite files would be in a position
>>>>> to just execute the file itself directly instead.
>>>>
>>>>
>>>> On further consideration, I think there’s another showstopper. This
>>>> patch is a potentially severe ABI break. Right now, loading a module
>>>> *copies* it into memory and does not hold a reference to the underlying fs.
>>>> With the patch applied, all kinds of use cases can break in gnarly ways.
>>>> Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC
>>>> module from initrd, then umount it, then clear the ramdisk, something will
>>>> go horribly wrong. Exactly what goes wrong depends on whether userspace
>>>> notices that umount() failed. Similarly, if you load one of these modules
>>>> over a network and then lose your connection, you have a problem.
>>>
>>>
>>> there is not abi breakage and file cannot disappear from running task.
>>> One cannot umount fs while file is still being used.
>>
>>
>> Sure it is. Without your patch, init_module doesn’t keep using the
>> file, so it’s common practice to load a module and then delete or
>> unmount it. With your patch, the unmount case breaks. This is likely
>> to break existing userspace, so, in Linux speak it’s an ABI break.
>
>
> please read the patch again.
> file is only used in case of umh modules.
> There is zero difference in default case.

Say I'm running some distro or other working Linux setup. I upgrade
my kernel to a kernel that uses umh modules. The user tooling
generates some kind of boot entry that references the new kernel
image, and it also generates a list of modules to be loaded at various
times in the boot process. This list might, and probably should,
include one or more umh modules. (You are being very careful to make
sure that depmod keeps working, so umh modules are clearly intended to
work with existing tooling.) So now I have a kernel image and some
modules to be loaded from various places. And I have an init script
(initramfs's '/init' or similar) that will call init_module() on that
.ko file. That script was certainly written under the assumption
that, once init_module() returns, the kernel is done with the .ko
file. With your patch applied, that assumption is no longer true.

RHEL 5 uses initrd and is still supported. For all I know, some
embedded setups put their initial /lib on some block device that
literally disappears after they finish booting. There are livecds
that let you boot in a mode that lets you remove the CD after you
finish booting. Heck, someone must have built something that calls
init_module() on a FUSE filesystem.

Heck, on my laptop, all my .ko files are labeled
system_u:object_r:modules_object_t:s0. I wonder how many SELinux
setups (and AppArmor, etc) will actually disallow execve() on modules?

>
>>>
>>>>
>>>> The “read twice” thing is also bad for another reason: containers.
>>>> Suppose I have a setup where a container can load a signed module blob. With
>>>> the read twice code, the container can race and run an entirely different
>>>> blob outside the container.
>>>
>>>
>>> Not only "read twice", but "read many".
>>> If .text sections of elf that are not yet in memory can be modified
>>> by malicious user, later they will be brought in with different code.
>>> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.
>>
>>
>> Given this issue, I think the patch would need Kees’s explicit ack. I
>> had initially thought your patch had minimal security impact, but I
>> was wrong Module security is very complicated and needs to satisfy a
>> bunch of requirements. There is a lot of code in the kernel that
>> assumes that it’s sufficient to verify a module once at load time,
>> your patch changes that, and this has all kinds of nasty interactions
>> with autoloading.
>
>
> not true. you misread the patch and making incorrect conclusions.

So please tell my exactly how I misread the patch and why Linus'
conclusion (which is what I'm echoing) is wrong.

>
> I think you need to stop overreacting on non-issue.
>

Can you please try to have a constructive discussion here? It's not
so enjoyable to review patches when author declares review comments to
be non-issues without actually explaining *why* they're non-issues.

Alexei, I'm willing to be shown that I'm wrong, but you have to show
me how I'm wrong rather than just telling me over and over that I'm
wrong.

2018-03-09 17:34:29

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/9/18 8:24 AM, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 3:39 PM, Alexei Starovoitov <[email protected]> wrote:
>> On 3/9/18 7:16 AM, Andy Lutomirski wrote:
>>>>>
>>>>> On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>>
>>>>> On 3/8/18 7:54 PM, Andy Lutomirski wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Mar 8, 2018, at 7:06 PM, Linus Torvalds
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Honestly, that "read twice" thing may be what scuttles this.
>>>>>> Initially, I thought it was a non-issue, because anybody who controls
>>>>>> the module subdirectory enough to rewrite files would be in a position
>>>>>> to just execute the file itself directly instead.
>>>>>
>>>>>
>>>>> On further consideration, I think there’s another showstopper. This
>>>>> patch is a potentially severe ABI break. Right now, loading a module
>>>>> *copies* it into memory and does not hold a reference to the underlying fs.
>>>>> With the patch applied, all kinds of use cases can break in gnarly ways.
>>>>> Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC
>>>>> module from initrd, then umount it, then clear the ramdisk, something will
>>>>> go horribly wrong. Exactly what goes wrong depends on whether userspace
>>>>> notices that umount() failed. Similarly, if you load one of these modules
>>>>> over a network and then lose your connection, you have a problem.
>>>>
>>>>
>>>> there is not abi breakage and file cannot disappear from running task.
>>>> One cannot umount fs while file is still being used.
>>>
>>>
>>> Sure it is. Without your patch, init_module doesn’t keep using the
>>> file, so it’s common practice to load a module and then delete or
>>> unmount it. With your patch, the unmount case breaks. This is likely
>>> to break existing userspace, so, in Linux speak it’s an ABI break.
>>
>>
>> please read the patch again.
>> file is only used in case of umh modules.
>> There is zero difference in default case.
>
> Say I'm running some distro or other working Linux setup. I upgrade
> my kernel to a kernel that uses umh modules. The user tooling
> generates some kind of boot entry that references the new kernel
> image, and it also generates a list of modules to be loaded at various
> times in the boot process. This list might, and probably should,
> include one or more umh modules. (You are being very careful to make
> sure that depmod keeps working, so umh modules are clearly intended to
> work with existing tooling.) So now I have a kernel image and some
> modules to be loaded from various places. And I have an init script
> (initramfs's '/init' or similar) that will call init_module() on that
> .ko file. That script was certainly written under the assumption
> that, once init_module() returns, the kernel is done with the .ko
> file. With your patch applied, that assumption is no longer true.

There is no intent to use umh modules during boot process.
This is not a replacement for drivers and kernel modules.
From your earlier comments regarding usb driver as umh module
I suspect you're assuming that everything will sooner or later
will convert to umh model.
There is no such intent. umh approach is targeting one specific
use case of converting one stable uapi into another stable uapi.
It's all control plane that can be a slow as it needs to be.
Critical kernel datapath is not going to be affected
(especially the one needed to boot)
because umh is a user mode app running async with the rest of kernel.

With patch applied there are still zero users of it.
bpfilter and nft2bpf are the only two that are going to use
this interface. Every other potential user will be code reviewed
just like everything else in the kernel land.
So your statement that with patch applied there is an ABI breakage
is just false.

At the same time I agree that keeping fs pinned while umh module
started from that fs is not great, so I intend to solve it somehow
in v2 while keeping the approach being elf based for
debuggability reasons explained earlier.

> Heck, on my laptop, all my .ko files are labeled
> system_u:object_r:modules_object_t:s0. I wonder how many SELinux
> setups (and AppArmor, etc) will actually disallow execve() on modules?

I don't think it's a good idea to move lsm into umh.

> Can you please try to have a constructive discussion here?

I'd like to ask the same favor.
Claiming ABI breakage when there is none is not constructive.
Saying that "ohh there must be a security issue here, because it looks
complex" is not constructive either.


2018-03-09 18:16:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 09, 2018 at 09:32:36AM -0800, Alexei Starovoitov wrote:
> On 3/9/18 8:24 AM, Andy Lutomirski wrote:
> > On Fri, Mar 9, 2018 at 3:39 PM, Alexei Starovoitov <[email protected]> wrote:
> > > On 3/9/18 7:16 AM, Andy Lutomirski wrote:
> > > > > >
> > > > > > On Mar 8, 2018, at 9:08 PM, Alexei Starovoitov <[email protected]> wrote:
> > > > > >
> > > > > > On 3/8/18 7:54 PM, Andy Lutomirski wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Mar 8, 2018, at 7:06 PM, Linus Torvalds
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Honestly, that "read twice" thing may be what scuttles this.
> > > > > > > Initially, I thought it was a non-issue, because anybody who controls
> > > > > > > the module subdirectory enough to rewrite files would be in a position
> > > > > > > to just execute the file itself directly instead.
> > > > > >
> > > > > >
> > > > > > On further consideration, I think there’s another showstopper. This
> > > > > > patch is a potentially severe ABI break. Right now, loading a module
> > > > > > *copies* it into memory and does not hold a reference to the underlying fs.
> > > > > > With the patch applied, all kinds of use cases can break in gnarly ways.
> > > > > > Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC
> > > > > > module from initrd, then umount it, then clear the ramdisk, something will
> > > > > > go horribly wrong. Exactly what goes wrong depends on whether userspace
> > > > > > notices that umount() failed. Similarly, if you load one of these modules
> > > > > > over a network and then lose your connection, you have a problem.
> > > > >
> > > > >
> > > > > there is not abi breakage and file cannot disappear from running task.
> > > > > One cannot umount fs while file is still being used.
> > > >
> > > >
> > > > Sure it is. Without your patch, init_module doesn’t keep using the
> > > > file, so it’s common practice to load a module and then delete or
> > > > unmount it. With your patch, the unmount case breaks. This is likely
> > > > to break existing userspace, so, in Linux speak it’s an ABI break.
> > >
> > >
> > > please read the patch again.
> > > file is only used in case of umh modules.
> > > There is zero difference in default case.
> >
> > Say I'm running some distro or other working Linux setup. I upgrade
> > my kernel to a kernel that uses umh modules. The user tooling
> > generates some kind of boot entry that references the new kernel
> > image, and it also generates a list of modules to be loaded at various
> > times in the boot process. This list might, and probably should,
> > include one or more umh modules. (You are being very careful to make
> > sure that depmod keeps working, so umh modules are clearly intended to
> > work with existing tooling.) So now I have a kernel image and some
> > modules to be loaded from various places. And I have an init script
> > (initramfs's '/init' or similar) that will call init_module() on that
> > .ko file. That script was certainly written under the assumption
> > that, once init_module() returns, the kernel is done with the .ko
> > file. With your patch applied, that assumption is no longer true.
>
> There is no intent to use umh modules during boot process.

For _your_ use case, yes. For mine and Andy's and someone else's in the
future, it might be :)

You are creating a very generic, new, user/kernel api that a whole bunch
of people are going to want to use. Let's not hamper the ability for us
all to use this right from the beginning please.

> This is not a replacement for drivers and kernel modules.
> From your earlier comments regarding usb driver as umh module
> I suspect you're assuming that everything will sooner or later
> will convert to umh model.

We have userspace drivers for USB today, being able to drag that
out-of-tree codebase into the kernel is a _HUGE_ bonus, and something
that I would love to do for a lot of reasons. I also can see moving
some of our existing in-kernel drivers out of the kernel in a way that
provides "it just works" functionality by using this type of feature.

So again, please don't prevent us from using this, otherwise you should
be just calling this "bpf_usermode_helper" :)

Oh, and for the record, I like Andy's proposal as well as dumping this
into a kernel module "blob" with the exception that this now would take
up unswapable memory, which isn't the nicest and is one big reason we
removed the in-kernel-memory firmware blobs many years ago.

thanks,

greg k-h

2018-03-09 18:19:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 8, 2018 at 9:08 PM, Alexei Starovoitov <[email protected]> wrote:
>
> there is not abi breakage and file cannot disappear from running task.
> One cannot umount fs while file is still being used.

I think that "cannot umount" part _is_ the ABI breakage that Andy is
talking about.

> Not only "read twice", but "read many".
> If .text sections of elf that are not yet in memory can be modified
> by malicious user, later they will be brought in with different code.
> I think the easiest fix to tighten this "umh modules" to CAP_SYS_ADMIN.

I don't think it actually fixes anything. It might just break things.
For all we know, people run modprobe with CAP_SYS_MODULE only, since
that is obviously the only capability it needs.

Hmm. I wish we had an "execute blob" model, but we really don't, and
it would be hard/impossible to do without pinning the pages in memory.

My gut feel is that the right direction to explore is:

- consider the module loaded for the whole duration of the execve. So
the execution is a *blocking* operation (and we get the correct
exclusion semantics)

- use deny_write_access() to make sure that we don't have active
writers and cannot get them during the execve.

The above mean that something that executes to load a new ebpf rule
will work very well. But a "start and forget" will not work (although
you can obviously do so with a internal fork/exec).

Hmm?

Linus

2018-03-09 18:25:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries



> On Mar 9, 2018, at 10:15 AM, Greg KH <[email protected]> wrote:
>

>
> Oh, and for the record, I like Andy's proposal as well as dumping this
> into a kernel module "blob" with the exception that this now would take
> up unswapable memory, which isn't the nicest and is one big reason we
> removed the in-kernel-memory firmware blobs many years ago.
>

It might not be totally crazy to back it by tmpfs.

2018-03-09 18:31:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 09, 2018 at 10:23:27AM -0800, Andy Lutomirski wrote:
>
>
> > On Mar 9, 2018, at 10:15 AM, Greg KH <[email protected]> wrote:
> >
>
> >
> > Oh, and for the record, I like Andy's proposal as well as dumping this
> > into a kernel module "blob" with the exception that this now would take
> > up unswapable memory, which isn't the nicest and is one big reason we
> > removed the in-kernel-memory firmware blobs many years ago.
> >
>
> It might not be totally crazy to back it by tmpfs.

Ah, yeah, tricky, but yes, I'd be fine with that.

Micro-kernel here we come! :)

2018-03-09 18:36:28

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

From: Linus Torvalds <[email protected]>
Date: Fri, 9 Mar 2018 10:17:42 -0800

> - use deny_write_access() to make sure that we don't have active
> writers and cannot get them during the execve.

I agree that this is necessary for image validation purposes.

2018-03-09 18:44:46

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 10:35 AM, David Miller <[email protected]> wrote:
> From: Linus Torvalds <[email protected]>
> Date: Fri, 9 Mar 2018 10:17:42 -0800
>
>> - use deny_write_access() to make sure that we don't have active
>> writers and cannot get them during the execve.
>
> I agree that this is necessary for image validation purposes.

Module loading (via kernel_read_file()) already uses
deny_write_access(), and so does do_open_execat(). As long as module
loading doesn't call allow_write_access() before the execve() has
started in the new implementation, I think we'd be covered here.

-Kees

--
Kees Cook
Pixel Security

2018-03-09 18:50:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries



> On Mar 9, 2018, at 10:17 AM, Linus Torvalds <[email protected]> wrote:
>

>
> Hmm. I wish we had an "execute blob" model, but we really don't, and
> it would be hard/impossible to do without pinning the pages in memory.
>

Why so hard? We can already execute a struct file for execveat, and Alexei already has this working for umh. Surely we can make an immutable (as in even root can’t write it) kernel-internal tmpfs file, execveat it, then unlink it. And /proc/PID/exe should be openable and readable. The blob itself would be __initdata so it gets discarded after it lands in tmpfs.

2018-03-09 18:52:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 10:43 AM, Kees Cook <[email protected]> wrote:
>
> Module loading (via kernel_read_file()) already uses
> deny_write_access(), and so does do_open_execat(). As long as module
> loading doesn't call allow_write_access() before the execve() has
> started in the new implementation, I think we'd be covered here.

No. kernel_read_file() only does it *during* the read.

So there's a huge big honking gap between the two.

Also, the second part of my suggestion was to be entirely synchronous
with the whole execution of the process, and do it within the "we do
mutual exclusion fo rmodules with the same name" logic.

Note that Andrei's patch uses UMH_WAIT_EXEC. That's basically
"vfork+exec" - it only waits for the exec to have started, it doesn't
wait for the whole thing.

So I'm saying "use UMH_WAIT_PROC, do it in a different place, and make
sure you cover the whole sequence with deny_write_access()".

Linus

2018-03-09 18:53:41

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>
>
>> On Mar 9, 2018, at 10:15 AM, Greg KH <[email protected]> wrote:
>>
>
>>
>> Oh, and for the record, I like Andy's proposal as well as dumping this
>> into a kernel module "blob" with the exception that this now would take
>> up unswapable memory, which isn't the nicest and is one big reason we
>> removed the in-kernel-memory firmware blobs many years ago.
>>
>
> It might not be totally crazy to back it by tmpfs.

interesting. how do you propose to do it?
Something like:
- create /umh_module_tempxxx dir
- mount tmpfs there
- copy elf into it and exec it?


2018-03-09 18:54:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 10:48 AM, Andy Lutomirski <[email protected]> wrote:
>> On Mar 9, 2018, at 10:17 AM, Linus Torvalds <[email protected]> wrote:
>>
>> Hmm. I wish we had an "execute blob" model, but we really don't, and
>> it would be hard/impossible to do without pinning the pages in memory.
>>
>
> Why so hard? We can already execute a struct file for execveat, and Alexei already has this working for umh.
> Surely we can make an immutable (as in even root can’t write it) kernel-internal tmpfs file, execveat it, then unlink it.

And what do you think that does? It pins the memory for the whole
time. As a *copy* of the original file.

Anyway, see my other suggestion that makes this all irrelevant. Just
wait synchronously (until the exit), and just use deny_write_access().

The "synchronous wait" means that you don't have the semantic change
(and really., it's *required* anyway for the whole mutual exclusion
against another thread racing to load the same module), and the
deny_write_access() means that we don't neeed to make another copy.

Linus

2018-03-09 18:56:08

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 10:50 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Mar 9, 2018 at 10:43 AM, Kees Cook <[email protected]> wrote:
>>
>> Module loading (via kernel_read_file()) already uses
>> deny_write_access(), and so does do_open_execat(). As long as module
>> loading doesn't call allow_write_access() before the execve() has
>> started in the new implementation, I think we'd be covered here.
>
> No. kernel_read_file() only does it *during* the read.

Ah, true. And looking at this again, shouldn't deny_write_access()
happen _before_ the LSM check in kernel_read_file()? That looks like a
problem...

-Kees

--
Kees Cook
Pixel Security

2018-03-09 18:57:09

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

From: Alexei Starovoitov <[email protected]>
Date: Fri, 9 Mar 2018 10:50:49 -0800

> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>> It might not be totally crazy to back it by tmpfs.
>
> interesting. how do you propose to do it?
> Something like:
> - create /umh_module_tempxxx dir
> - mount tmpfs there
> - copy elf into it and exec it?

I think the idea is that it's an internal tmpfs mount that only
the kernel has access too.

And I don't think that even hurts your debuggability concerns. The
user can just attach using the foo.ko file in the actual filesystem.


2018-03-09 18:58:36

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

From: Linus Torvalds <[email protected]>
Date: Fri, 9 Mar 2018 10:53:45 -0800

> Anyway, see my other suggestion that makes this all irrelevant. Just
> wait synchronously (until the exit), and just use deny_write_access().

What exit?

Once the helper UMH is invoked, it runs asynchronously taking eBPF
translation requests.

2018-03-09 19:00:50

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/9/18 10:50 AM, Linus Torvalds wrote:
> On Fri, Mar 9, 2018 at 10:43 AM, Kees Cook <[email protected]> wrote:
>>
>> Module loading (via kernel_read_file()) already uses
>> deny_write_access(), and so does do_open_execat(). As long as module
>> loading doesn't call allow_write_access() before the execve() has
>> started in the new implementation, I think we'd be covered here.
>
> No. kernel_read_file() only does it *during* the read.
>
> So there's a huge big honking gap between the two.
>
> Also, the second part of my suggestion was to be entirely synchronous
> with the whole execution of the process, and do it within the "we do
> mutual exclusion fo rmodules with the same name" logic.
>
> Note that Andrei's patch uses UMH_WAIT_EXEC. That's basically
> "vfork+exec" - it only waits for the exec to have started, it doesn't
> wait for the whole thing.

It's not waiting for the whole thing, because once bpfilter starts it
stays running/sleeping because it's stateful. It needs normal
malloc-ed memory to keep the state of iptable->bpf translation that
it will use later during subsequent translation calls.
Theoretically it can use bpf maps pinned in kernel memory to keep
this state, but then it's non-swappable. It's better to keep bpfilter
state in its own user memory.


2018-03-09 19:13:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 10:57 AM, David Miller <[email protected]> wrote:
>
> Once the helper UMH is invoked, it runs asynchronously taking eBPF
> translation requests.

How?

Really. See my comment about mutual exclusion. The current patch is
*broken* because it doesn't handle it. Really.

Think of it this way - you may have now started *five* of those things
concurrently by mistake.

The actual module loading case never does that, because the actual
module loading case has per-module serialization that got
short-circuited.

How are you going to handle five processes doing the same setup concurrently?

Linus

2018-03-09 19:39:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 6:55 PM, David Miller <[email protected]> wrote:
> From: Alexei Starovoitov <[email protected]>
> Date: Fri, 9 Mar 2018 10:50:49 -0800
>
>> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>>> It might not be totally crazy to back it by tmpfs.
>>
>> interesting. how do you propose to do it?
>> Something like:
>> - create /umh_module_tempxxx dir
>> - mount tmpfs there
>> - copy elf into it and exec it?
>
> I think the idea is that it's an internal tmpfs mount that only
> the kernel has access too.

That's what I was imagining. There's precedent. For example, there's
a very short piece of code that does it in
drivers/gpu/drm/i915/i915_gemfs.c.


>
> And I don't think that even hurts your debuggability concerns. The
> user can just attach using the foo.ko file in the actual filesystem.
>

Not if the .ko is actually a shim that actually just contains a blob
and a few lines of code to kick off the umh. But one could still
debug it using kernel debug symbols (like vDSO debugging works right
now, at least if your distro is in a good mood) or by reading the
contents from /proc/PID/exe.

2018-03-09 19:40:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
<[email protected]> wrote:
>
> How are you going to handle five processes doing the same setup concurrently?

Side note: it's not just serialization. It's also "is it actually up
and running".

The rule for "request_module()" (for a real module) has been that it
returns when the module is actually alive and active and have done
their initcalls.

The UMH_WAIT_EXEC behavior (ignore the serialization - you could do
that in the caller) behavior doesn't actually have any semantics AT
ALL. It only means that you get the error returns from execve()
itself, so you know that the executable file actually existed and
parsed right enough to get started.

But you don't actually have any reason to believe that it has *done*
anything, and started processing any requests. There's no reason
what-so-ever to believe that it has registered itself for any
asynchronous requests or anything like that.

So in the real module case, you can do

request_module("modulename");

and just start using whatever resource you just requested. So the
netfilter code literally does

request_module("nft-chain-%u-%.*s", family,
nla_len(nla), (const char *)nla_data(nla));
nfnl_lock(NFNL_SUBSYS_NFTABLES);
type = __nf_tables_chain_type_lookup(nla, family);
if (type != NULL)
return ERR_PTR(-EAGAIN);

and doesn't even care about error handling for request_module()
itself, because it knows that either the module got loaded and is
ready, or something failed. And it needs to look that chain type up
anyway, so the failure is indicated by _that_.

With a UMH_WAIT_EXEC? No. You have *nothing*. You know the thing
started, but it might have SIGSEGV'd immediately, and you have
absolutely no way of knowing, and absolutely no way of even figuring
it out. You can wait - forever - for something to bind to whatever
dynamic resource you're expecting. You'll just fundamentally never
know.

You can try again, of course. Add a timeout, and try again in five
seconds or something. Maybe it will work then. Maybe it won't. You
won't have any way to know the _second_ time around either. Or the
third. Or...

See why I say it has to be synchronous?

If it's synchronous, you can actually do things like

(a) maybe you only need a one-time thing, and don't have any state
("load fixed tables, be done") and that's it. If the process returns
with no error code, you're all done, and you know it's fine.

(b) maybe the process wants to start a listener daemon or something
like the traditional inetd model. It can open the socket, it can start
listening on it, and it can fork off a child and check it's status. It
can then do exit(0) if everything is fine, and now request_module()
returns.

see the difference? Even if you ended up with a background process
(like in that (b) case), you did so with *error* handling, and you did
so knowing that the state has actually been set up by the time the
request_module() returns.

And if you use the proper module loading exclusion, it also means that
that (b) can know it's the only process starting up, and it's not
racing with another one. It might still want to do the usual
lock-files in user space to protect against just the admin starting it
manually, but at least you don't have the situation that a hundred
threads just had a thundering herd where they all ended up using the
same kernel facility, and they all independently started a hundred
usermode helpers.

Linus

2018-03-09 19:46:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 9, 2018 at 7:38 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> How are you going to handle five processes doing the same setup concurrently?
>
> Side note: it's not just serialization. It's also "is it actually up
> and running".
>

I think the right way to solve this would be to take a hint from
systemd's socket activation model. The current patch had the module
load process kick off an ELF binary that goes an registers itself to
handle something. We can turn that around. Make the module init
function create the socket (or pipe or whatever) receives request and
pass it to the user program as stdin. Then the kernel can start
queueing requests into the socket immediately, and the user program
will get to them whenever it finishes initializing. Or it can write
some message to the socket saying "hey, I'm ready".

This also completely avoids the issue where some clever user manually
loads the "module" with exec() ("hey, I'm so clever, I can just run
the damn thing instead if using init_module()!" or writes an
out-of-tree program that uses whatever supposedly secret API the
in-kernel binary is supposed to use to register itself (and I know
people who would do exactly that!) and the kernel does
request_module() at roughly the same time.

2018-03-10 01:45:58

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/9/18 11:37 AM, Andy Lutomirski wrote:
> On Fri, Mar 9, 2018 at 6:55 PM, David Miller <[email protected]> wrote:
>> From: Alexei Starovoitov <[email protected]>
>> Date: Fri, 9 Mar 2018 10:50:49 -0800
>>
>>> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>>>> It might not be totally crazy to back it by tmpfs.
>>>
>>> interesting. how do you propose to do it?
>>> Something like:
>>> - create /umh_module_tempxxx dir
>>> - mount tmpfs there
>>> - copy elf into it and exec it?
>>
>> I think the idea is that it's an internal tmpfs mount that only
>> the kernel has access too.
>
> That's what I was imagining. There's precedent. For example, there's
> a very short piece of code that does it in
> drivers/gpu/drm/i915/i915_gemfs.c.

I can do "monkey see monkey do" approach which will look like:
type = get_fs_type("tmpfs");
fs = kern_mount(type);

/* for each request_umh("foo") */
file = shmem_file_setup_with_mnt(fs, "umh_foo");
do {
pagecache_write_begin(file,...);
memcpy()
pagecache_write_end();
} while (umh_elf_size);
do_execve_file(file);
fput(file);

while keeping fs mounted forever?
is there better way?


2018-03-10 02:36:11

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/9/18 11:38 AM, Linus Torvalds wrote:
> On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> How are you going to handle five processes doing the same setup concurrently?
>
> Side note: it's not just serialization. It's also "is it actually up
> and running".
>
> The rule for "request_module()" (for a real module) has been that it
> returns when the module is actually alive and active and have done
> their initcalls.
>
> The UMH_WAIT_EXEC behavior (ignore the serialization - you could do
> that in the caller) behavior doesn't actually have any semantics AT
> ALL. It only means that you get the error returns from execve()
> itself, so you know that the executable file actually existed and
> parsed right enough to get started.
>
> But you don't actually have any reason to believe that it has *done*
> anything, and started processing any requests. There's no reason
> what-so-ever to believe that it has registered itself for any
> asynchronous requests or anything like that.
>
> So in the real module case, you can do
>
> request_module("modulename");
>
> and just start using whatever resource you just requested. So the
> netfilter code literally does
>
> request_module("nft-chain-%u-%.*s", family,
> nla_len(nla), (const char *)nla_data(nla));
> nfnl_lock(NFNL_SUBSYS_NFTABLES);
> type = __nf_tables_chain_type_lookup(nla, family);
> if (type != NULL)
> return ERR_PTR(-EAGAIN);
>
> and doesn't even care about error handling for request_module()
> itself, because it knows that either the module got loaded and is
> ready, or something failed. And it needs to look that chain type up
> anyway, so the failure is indicated by _that_.
>
> With a UMH_WAIT_EXEC? No. You have *nothing*. You know the thing
> started, but it might have SIGSEGV'd immediately, and you have
> absolutely no way of knowing, and absolutely no way of even figuring
> it out. You can wait - forever - for something to bind to whatever
> dynamic resource you're expecting. You'll just fundamentally never
> know.
>
> You can try again, of course. Add a timeout, and try again in five
> seconds or something. Maybe it will work then. Maybe it won't. You
> won't have any way to know the _second_ time around either. Or the
> third. Or...
>
> See why I say it has to be synchronous?
>
> If it's synchronous, you can actually do things like
>
> (a) maybe you only need a one-time thing, and don't have any state
> ("load fixed tables, be done") and that's it. If the process returns
> with no error code, you're all done, and you know it's fine.

I agree that sync approach nicely fits this use case, but waiting
for umh brings the whole set of suspend/resume issues that
Luis explained earlier and if I understood his points that stuff
is still not quite working right and he's planning a set of fixes.
I really don't want the unknown timeline of fixes for 'waiting umh'
to become a blocker for the bpfilter project ...

> (b) maybe the process wants to start a listener daemon or something
> like the traditional inetd model. It can open the socket, it can start
> listening on it, and it can fork off a child and check it's status. It
> can then do exit(0) if everything is fine, and now request_module()
> returns.

... while for bpfilter use case we need a daemon and starting umh
with UMH_WAIT_PROC which does fork() right away and parent does exit(0)
is not any different from kernel pov than UMH_WAIT_EXEC.
The kernel will be in the same situation described above. The forked
process could have sigsegv quickly (right after telling parent that
everything is ok) and kernel is waiting forever.

That situation is what I was alluding to regarding that kernel<->umh
need to have some sort of health check protocol.
Regardless of how umh is invoked.

I think what Andy proposing with kernel creating a pipe and giving
it to umh can be used as this health check and means of
'load exclusion' to make sure that only one requested umh is running.

Also I like Luis suggestion to use some other name than request_module()
Something like:
request_umh_module_sync("foo");
request_umh_module_nowait("foo");

in both cases the kernel would create a pipe, call umh either
with UMH_WAIT_PROC or UMH_WAIT_EXEC and make sure that umh
responds to first hello message.
On success they would return a handle to that pipe and umh's pid.
The further interaction between the kernel and umh will be
via that pipe. If kernel->umh request times out the kernel
side of bpfilter would sigkill the task and do request_umh() again.
If that request_umh() fails there will be no retry, since
at this point it's clear that given umh is broken.

I'll implement only _nowait() version, since that's what we need
for bpfilter and when suspend/resume issues are solved somebody
who cares about usb driver in user mode can implement the _sync()
variant.

All that on top of tmpfs->file->execve_file hack that I still
need feedback on.


2018-03-10 14:09:57

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Fri, Mar 09, 2018 at 06:34:18PM -0800, Alexei Starovoitov wrote:
> On 3/9/18 11:38 AM, Linus Torvalds wrote:
> > On Fri, Mar 9, 2018 at 11:12 AM, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > How are you going to handle five processes doing the same setup
> > > concurrently?

Let's keep in mind we don't have a formal way to specify this as well
for modules as well, other than kconfig. Ie it'd be up to the author
of the modules to pick this up and make things mutually exclusive.

> > Side note: it's not just serialization. It's also "is it actually up
> > and running".
> >
> > The rule for "request_module()" (for a real module) has been that it
> > returns when the module is actually alive and active and have done
> > their initcalls.

Unfortunately this is not accurate though, the loose grammar around this fact
is one of the reasons why long term I think either new API should be added, or
the existing request_module() API modified.

request_module() is not deterministic today, try_then_request_module() *is* but
its IMHO its a horrible grammatical solution, and I'm not a fan of the idea of
umh modules perpetuating this nonsense *long term*. Details below.

> > The UMH_WAIT_EXEC behavior (ignore the serialization - you could do
> > that in the caller) behavior doesn't actually have any semantics AT
> > ALL. It only means that you get the error returns from execve()
> > itself, so you know that the executable file actually existed and
> > parsed right enough to get started.
> >
> > But you don't actually have any reason to believe that it has *done*
> > anything, and started processing any requests. There's no reason
> > what-so-ever to believe that it has registered itself for any
> > asynchronous requests or anything like that.
>
> I agree that sync approach nicely fits this use case, but waiting
> for umh brings the whole set of suspend/resume issues that
> Luis explained earlier and if I understood his points that stuff
> is still not quite working right

It works enough that suspend works well enough on Linux today, but the primary
reason is that the big kernel/umh.c API user today is the firmware API for the
old fallback firmware interface and it has in place the sync
usermodehelper_read_trylock() and async async usermodehelper_read_lock_wait().
Note that in the fallback case there is just the uevent call.

> and he's planning a set of fixes.

The move of the umh locks out of the non-fallback case was a long term goal I
had which I was planning on doing slowly, but recently got jump started via
v4.14 via commit f007cad159e9 ("Revert "firmware: add sanity check on
shutdown/suspend"). As such that goal is now complete thanks to Linus correctly
pushing it forward.

> I really don't want the unknown timeline of fixes for 'waiting umh'
> to become a blocker for the bpfilter project ...

The reason for me to bring up the suspend stuff was that there no other
*heavy* UMH API users in the kernel today, and just to highlight that
care must be taken if we want to consider in the future further
possibly heavy UMH callers which we *can* expect to be called around
suspend and resume.

*Iff* that will be the case for these new umh userspace modules, we can
evaluate a future plan. But not that this is as vague as suggesting the
same for any further kernel future UMH API user! If the only umh module
we expect for a while will be bpfilter and it we don't expect heavy use
during suspend/resume this is a non-issue and likely won't be for a while,
and all that *should be done* is become aware of the today's limitations
as we *document* any new API, even if its the simple and easy
request_umh_module*() calls.

Today's use of the UMH API to always use helper_lock() and prevent suspend
while a call is being issued should suffice, so long as these umh modules don't
become heavy users during suspend/resume.

Note that there even *further* further advanced suspend/resume considerations
with filesystems but we have a reasonable hand on what to do there [0] and
it frankly isn't *that* much work as I have most of it done already.

The only other suspend optimization I could think of left to evaluate may be
whether or not to we should generalize something like the firmware cache for
other UMH callers but that's really a long term topic.

So I would not say there is pending work left to do, it should suffice
to document the semantics and limitations if the umh modules are to be
added. That's it.

Linus has proven me right that the *concerns* I've had for these corner
cases are just that, and I do believe documenting the limitations should
suffice for new APIs.

[0] https://lkml.kernel.org/r/[email protected]

> Also I like Luis suggestion to use some other name than request_module()
> Something like:
> request_umh_module_sync("foo");
> request_umh_module_nowait("foo");
>
> in both cases the kernel would create a pipe, call umh either
> with UMH_WAIT_PROC or UMH_WAIT_EXEC and make sure that umh
> responds to first hello message.

Its not a widely known thing (and we can correct this with just slightly better
documentation) that contrary to what you might expect, today's synchronous
request_module() call in *now way* ensures the module requested got loaded,
even if 0 is returned, this *cannot* be guaranteed.

All that request_module() returning 0 tells us is that we found a
/sbin/modprobe, and kicked it off. We don't wait for a finit_module() to
complete.

We *could* add a generic form for this, *iff* this is is desirable. I have
enough random code on my development trees not submitted upstream which does
this *if we wanted it. I posted recently what I had stored in my kernel closet
for years about modules and aliases [1] simply because I figured Djalal may be
able to take advantage of aliases in-kernel for debugging purposes, but I had
*not yet* submitted code to actually make use of this for non-debugging
purposes and provide us with an optional way to truly get us that simple
deterministic API call that ensures a module *really* is loaded if the API
caller returns 0.

Today's *way* to resolve this was for Rusty to have added long ago the
try_then_request_module() helper. That is the real proper way to load modules
(TM) today *iff* you really needed to make sure that the service they provide
is loaded and available.

The beauty about it is its stupid simplicity -- try_then_request_module()
checks and enables for *any* service to be available in an argument and
if it got loaded, it'd be present.

While Rusty was content with it and its simplicity. I'm personally *not* a fan
of the loose ends it leaves behind. I believe it allows for sloppy programmers
and I don't think we should blame kernel developers today of believing that
request_module() *does* actually load their modules and its ready. The only way
to police this is manual code introspection, you can't even add generic
Coccinelle SmPL grammar rules to pick up if anyone used
try_then_request_module() or request_module() APIs properly if the goal was to
ensure the module was loaded. This is why I believe that this is sloppy and a
long term mistake.

But we haven't gotten the will to accept to change this. If others do believe
we *need this* in consideration for future uses of userspace modules, now would
be a good time to consider this *long term*.

The alternative to this would be a simple equivalent of try_then_request_module()
for UMH modules: try_umhm_then_request_umh_module() or whatever. So just as I
argued earlier over UMH limitations, this is not the end of the world for umh
modules, and it doesn't mean you can't get *properly* add umh modules upstream,
it would *just mean* we'd be perpetuating today's (IMHO) horrible and loose
semantics.

[1] https://lkml.kernel.org/r/[email protected]

Luis

2018-03-11 02:18:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Sat, Mar 10, 2018 at 1:43 AM, Alexei Starovoitov <[email protected]> wrote:
> On 3/9/18 11:37 AM, Andy Lutomirski wrote:
>>
>> On Fri, Mar 9, 2018 at 6:55 PM, David Miller <[email protected]> wrote:
>>>
>>> From: Alexei Starovoitov <[email protected]>
>>> Date: Fri, 9 Mar 2018 10:50:49 -0800
>>>
>>>> On 3/9/18 10:23 AM, Andy Lutomirski wrote:
>>>>>
>>>>> It might not be totally crazy to back it by tmpfs.
>>>>
>>>>
>>>> interesting. how do you propose to do it?
>>>> Something like:
>>>> - create /umh_module_tempxxx dir
>>>> - mount tmpfs there
>>>> - copy elf into it and exec it?
>>>
>>>
>>> I think the idea is that it's an internal tmpfs mount that only
>>> the kernel has access too.
>>
>>
>> That's what I was imagining. There's precedent. For example, there's
>> a very short piece of code that does it in
>> drivers/gpu/drm/i915/i915_gemfs.c.
>
>
> I can do "monkey see monkey do" approach which will look like:
> type = get_fs_type("tmpfs");
> fs = kern_mount(type);
>
> /* for each request_umh("foo") */
> file = shmem_file_setup_with_mnt(fs, "umh_foo");
> do {
> pagecache_write_begin(file,...);
> memcpy()
> pagecache_write_end();
> } while (umh_elf_size);
> do_execve_file(file);
> fput(file);
>
> while keeping fs mounted forever?
> is there better way?
>

Nice! I'm definitely not a pagecache expert, but it looks generally
sane. Once the thing is actually functional, we can bang on it, and
I'm sure that linux-mm will have some suggestions to tidy it up.

As for the actual lifetime of the filesystem, I think it should be
mounted once and never unmounted. Whenever it gains a second user,
the whole thing can be moved to mm/ or lib/ and all the users can
share the same mount.

Minor caveat: I would arrange the code a bit differently, like this:

static (or extern) unsigned char __initdata the_blob[];
static struct file *umh_blob_file;

static int __init my_module_init_function(void)
{
/* for each request_umh("foo") */
umh_blob_file = shmem_file_setup_with_mnt(fs, "umh_foo");
do {
pagecache_write_begin(umh_file,...);
memcpy()
pagecache_write_end();
} while (umh_elf_size);

/* the_blob is implicitly freed after this returns */
}

and then actually use the struct file later on. If and when you're
sure you're not going to spawn another copy, you can fput() it. This
way the memory becomes swappable immediately on load.

As for request_module() vs request_module_umh(), my advice would be to
write the code and then see what interface makes sense. I wouldn't be
surprised if it ends up making more sense to keep all of this entirely
independent from the module system.

P.S. I suspect that, before this hits a release, someone's going to
have to fiddle with the LSM hooks in do_execve() a bit to make sure
that LSM unconditionally approves this type of umh program. Otherwise
there might be pointless failures on some more locked down
configurations. But that can wait until it's more final and the
security folks review the code.

2018-03-11 21:08:21

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

Also,

Alexei you never answered my questions out aliases with the umh modules.
Long term this important to consider.

Luis

2018-03-11 21:46:27

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Sat, Mar 10, 2018 at 02:08:43PM +0000, Luis R. Rodriguez wrote:
> The alternative to this would be a simple equivalent of try_then_request_module()
> for UMH modules: try_umhm_then_request_umh_module() or whatever. So just as I
> argued earlier over UMH limitations, this is not the end of the world for umh
> modules, and it doesn't mean you can't get *properly* add umh modules upstream,
> it would *just mean* we'd be perpetuating today's (IMHO) horrible and loose
> semantics.

I was about to suggest that perhaps a try_umhm_then_request_umh_module() or
whatever should not be a macro -- but instead an actual routine, and we don't
export say the simple form to avoid non-deterministic uses of it from the
start... but the thing is *it'd have to be a macro* given that the *check* for
the module *has to be loose*, just as try_then_request_module()...

*Ugh* gross.

Another reason for me to want an actual deterministic clean proper solution
from the start.

Luis

2018-03-12 12:03:38

by Edward Cree

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 09/03/18 18:58, Alexei Starovoitov wrote:
> It's not waiting for the whole thing, because once bpfilter starts it
> stays running/sleeping because it's stateful.
So, this has been bugging me a bit.
If bpfilter takes a signal and crashes, all that state goes away.
Does that mean your iptables/netfilter config just got forgotten and next
 time you run iptables it disappears, so you have to re-apply it all again?
> It needs normal
> malloc-ed memory to keep the state of iptable->bpf translation that
> it will use later during subsequent translation calls.
> Theoretically it can use bpf maps pinned in kernel memory to keep
> this state, but then it's non-swappable. It's better to keep bpfilter
> state in its own user memory.
Perhaps the state should live in swappable kernel memory (e.g. a tmpfs
 thing, which bpfilter could access through a mount).  It'd be read-only
 to userspace, listing the existing rules (in untranslated form), and be
 updated to reflect the new rule after bpfilter has supplied the updated
 translation.
Then bpfilter can cache things if it wants, but the kernel remains the
 ultimate arbiter of the state and maintains it over a bpfilter crash.

Sound reasonable?

-Ed

2018-03-12 17:25:41

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/10/18 7:34 AM, Luis R. Rodriguez wrote:
> Also,
>
> Alexei you never answered my questions out aliases with the umh modules.
> Long term this important to consider.

aliases always felt like a crutch to me.
I can see an argument when they're used as 'alias pci:* foo'
but the way it's used in networking with ip_set_* and nf-* is
something I prefer not to ever do again.
Definitely no aliases for bpfilter umh.


2018-03-12 17:50:55

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/12/18 5:02 AM, Edward Cree wrote:
> On 09/03/18 18:58, Alexei Starovoitov wrote:
>> It's not waiting for the whole thing, because once bpfilter starts it
>> stays running/sleeping because it's stateful.
> So, this has been bugging me a bit.
> If bpfilter takes a signal and crashes, all that state goes away.
> Does that mean your iptables/netfilter config just got forgotten and next
> time you run iptables it disappears, so you have to re-apply it all again?
>> It needs normal
>> malloc-ed memory to keep the state of iptable->bpf translation that
>> it will use later during subsequent translation calls.
>> Theoretically it can use bpf maps pinned in kernel memory to keep
>> this state, but then it's non-swappable. It's better to keep bpfilter
>> state in its own user memory.
> Perhaps the state should live in swappable kernel memory (e.g. a tmpfs
> thing, which bpfilter could access through a mount). It'd be read-only
> to userspace, listing the existing rules (in untranslated form), and be
> updated to reflect the new rule after bpfilter has supplied the updated
> translation.
> Then bpfilter can cache things if it wants, but the kernel remains the
> ultimate arbiter of the state and maintains it over a bpfilter crash.

seems like overkill.
I consider crashing bpfilter same severity as kernel bug.
Whatever firewall rules already installed will continue to work,
but new ones won't be able to load and current set cannot be queried.
Control plane crashed, dataplane continues to work.
Still a ton better than whole system crash.
We have plenty of work ahead of us without worrying about restarting
that umh and reloading its state from tmpfs.
Something to consider for later phases of the project.


2018-03-13 08:51:41

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Mon, Mar 12, 2018 at 10:22:00AM -0700, Alexei Starovoitov wrote:
> On 3/10/18 7:34 AM, Luis R. Rodriguez wrote:
> > Also,
> >
> > Alexei you never answered my questions out aliases with the umh modules.
> > Long term this important to consider.
>
> aliases always felt like a crutch to me.
> I can see an argument when they're used as 'alias pci:* foo'
> but the way it's used in networking with ip_set_* and nf-* is
> something I prefer not to ever do again.
> Definitely no aliases for bpfilter umh.

I agree, let's not do that if at all possible for these types of
binaries.

greg k-h

2018-03-22 20:55:33

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Sat, Mar 10, 2018 at 03:16:52PM +0000, Luis R. Rodriguez wrote:
> On Sat, Mar 10, 2018 at 02:08:43PM +0000, Luis R. Rodriguez wrote:
> > The alternative to this would be a simple equivalent of try_then_request_module()
> > for UMH modules: try_umhm_then_request_umh_module() or whatever. So just as I
> > argued earlier over UMH limitations, this is not the end of the world for umh
> > modules, and it doesn't mean you can't get *properly* add umh modules upstream,
> > it would *just mean* we'd be perpetuating today's (IMHO) horrible and loose
> > semantics.
>
> I was about to suggest that perhaps a try_umhm_then_request_umh_module() or
> whatever should not be a macro -- but instead an actual routine, and we don't
> export say the simple form to avoid non-deterministic uses of it from the
> start... but the thing is *it'd have to be a macro* given that the *check* for
> the module *has to be loose*, just as try_then_request_module()...
>
> *Ugh* gross.
>
> Another reason for me to want an actual deterministic clean proper solution
> from the start.

I just thought of another consideration which should be made here for the long
term.

Some init systems have a timeout for kmod workers, that is the userspace
process which issues the modprobe call.

That was very well intentioned, however it ended up being nonsense, so at least
on SLE systemd we disable the timeout for kmod workers. What others do... is
unclear to me. Upstream wise the timeout was increased considerably, however,
*if* such timeout is in effect for users it has some implicit implications on
the number of devices a driver could support:

number_devices = systemd_timeout
-------------------------------------
max known probe time for driver

I've documented the logic to these conclusions [0].

It sounds like we *do* want a full sync wait mechanism, and as I noted I think
we should fix the determinism aspect of it. Since no aliases will be supported
for usermode modules this will be much easier to support, and I can volunteer
to help with that.

However given the above... if we're going to use request_module() API (or a
really fixed deterministic version of it later) for usermode kernel modules,
the limitation above still applies.

Are these usermode modules doing all the handy work on init? Or can it be
deferred once loaded? How much loading on init should a usermode module need?

If we can ensure that these usermode modules don't take *any time at all* on
their init *from the start*, it would be wonderful and we'd end up avoiding
some really odd corner case issues later.

[0] http://www.do-not-panic.com/2015/12/linux-asynchronous-probe.html

Luis

2018-03-22 22:17:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 22, 2018 at 8:54 PM, Luis R. Rodriguez <[email protected]> wrote:
> If we can ensure that these usermode modules don't take *any time at all* on
> their init *from the start*, it would be wonderful and we'd end up avoiding
> some really odd corner case issues later.
>

I don't see why this issue needs to exist at all for the new stuff.
After all, the new things aren't usermode modules per se. They're
regular kernel code (modular or otherwise) that loads a usermode
helper. All we need to do is to make sure that, if this is
distributed as a module, that it's init routine doesn't wait for a
long time, right?

2018-03-22 22:23:30

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On 3/22/18 3:15 PM, Andy Lutomirski wrote:
> On Thu, Mar 22, 2018 at 8:54 PM, Luis R. Rodriguez <[email protected]> wrote:
>> If we can ensure that these usermode modules don't take *any time at all* on
>> their init *from the start*, it would be wonderful and we'd end up avoiding
>> some really odd corner case issues later.
>>
>
> I don't see why this issue needs to exist at all for the new stuff.
> After all, the new things aren't usermode modules per se. They're
> regular kernel code (modular or otherwise) that loads a usermode
> helper. All we need to do is to make sure that, if this is
> distributed as a module, that it's init routine doesn't wait for a
> long time, right?

I've implemented all of the previous suggestions and
now there are zero changes to kernel/module.c
I still need to finish tracpoint stuff first and polish umh code a bit
before sending new version.
Let's hold on this thread until then.


2018-03-23 02:49:35

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

On Thu, Mar 22, 2018 at 3:15 PM, Andy Lutomirski <[email protected]> wrote:
> All we need to do is to make sure that, if this is
> distributed as a module, that it's init routine doesn't wait for a
> long time, right?

Yeap.

Luis

2018-05-02 09:13:06

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries


On Tue, 6 Mar 2018 15:42:41 -0800 Chris Mason <[email protected]> wrote:
> On 6 Mar 2018, at 11:12, Linus Torvalds wrote:
>
[...]
> >
> > I do *not* want this to be a magical way to hide things.
>
> Especially early on, this makes a lot of sense. But I wanted to plug
> bps and the hopefully growing set of bpf introspection tools:
>
> https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt
>
> Long term these are probably a good place to tell the admin what's going
> on.
(related to bpf itself not modprobe subject)

Hi Chris,

I just want to point out that the tool 'bpftool', is currently the
dominating tool for eBPF introspection. And the 'bps' tool you mention
seems to have gained less (open source) traction.

The bpftool is part of the kernel git-tree: tools/bpf/bpftool
https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/

And it even have bash-completion and man-pages in RST format so they
even render nicely when viewed via github:

https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool.rst
https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool-prog.rst
https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool-map.rst

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer