Hi,
I would like to propose a new syscall that exposes the functionality of
request_module() to userspace.
Propsed signature: request_module(char *module_name, char **args, int flags);
Where args and flags have to be NULL and 0 for the time being.
Rationale:
We are using nested, privileged containers which are loading kernel modules.
Currently we have to always pass around the contents of /lib/modules from the
root namespace which contains the modules.
(Also the containers need to have userspace components for moduleloading
installed)
The syscall would remove the need for this bookkeeping work.
If this has a chance of getting accepted I would be happy to provide an
implementation.
Thanks,
Thomas
On Wed, Sep 15, 2021 at 05:49:34PM +0200, Thomas Wei?schuh wrote:
> Hi,
>
> I would like to propose a new syscall that exposes the functionality of
> request_module() to userspace.
>
> Propsed signature: request_module(char *module_name, char **args, int flags);
> Where args and flags have to be NULL and 0 for the time being.
>
> Rationale:
>
> We are using nested, privileged containers which are loading kernel modules.
> Currently we have to always pass around the contents of /lib/modules from the
> root namespace which contains the modules.
> (Also the containers need to have userspace components for moduleloading
> installed)
>
> The syscall would remove the need for this bookkeeping work.
So you want any container to have the ability to "bust through" the
containers and load a module from the "root" of the system?
That feels dangerous, why not just allow a mount of /lib/modules into
the containers that you want to be able to load a module?
Why are modules somehow "special" here, they are just a resource that
has to be allowed (or not) to be accessed by a container like anything
else on a filesystem.
thanks,
greg k-h
On 2021-09-15T18:02+0200, Greg KH wrote:
> On Wed, Sep 15, 2021 at 05:49:34PM +0200, Thomas Weißschuh wrote:
> > Hi,
> >
> > I would like to propose a new syscall that exposes the functionality of
> > request_module() to userspace.
> >
> > Propsed signature: request_module(char *module_name, char **args, int flags);
> > Where args and flags have to be NULL and 0 for the time being.
> >
> > Rationale:
> >
> > We are using nested, privileged containers which are loading kernel modules.
> > Currently we have to always pass around the contents of /lib/modules from the
> > root namespace which contains the modules.
> > (Also the containers need to have userspace components for moduleloading
> > installed)
> >
> > The syscall would remove the need for this bookkeeping work.
>
> So you want any container to have the ability to "bust through" the
> containers and load a module from the "root" of the system?
Only those with CAP_SYS_MODULE.
Having this capability would also allow them load the module normally when
mounted in or potentially downloaded from the internet.
> That feels dangerous, why not just allow a mount of /lib/modules into
> the containers that you want to be able to load a module?
This is what we are currently doing. But sometimes this gets forgotten at some
point in the chain of nested containers/namespaces and things break.
> Why are modules somehow "special" here, they are just a resource that
> has to be allowed (or not) to be accessed by a container like anything
> else on a filesystem.
They are special insofar as they always have to match the running kernel.
Which is managed by the root namespace.
The biggest problems would probably arise if the root namespace has non-standard
modules available which the container would normally not have access to.
I think this is a big potential problem and which would not be justified by the
quality of life improvement.
Sorry for the noise.
Thomas
On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <[email protected]> wrote:
>
> Hi,
>
> I would like to propose a new syscall that exposes the functionality of
> request_module() to userspace.
>
> Propsed signature: request_module(char *module_name, char **args, int flags);
> Where args and flags have to be NULL and 0 for the time being.
>
> Rationale:
>
> We are using nested, privileged containers which are loading kernel modules.
> Currently we have to always pass around the contents of /lib/modules from the
> root namespace which contains the modules.
> (Also the containers need to have userspace components for moduleloading
> installed)
>
> The syscall would remove the need for this bookkeeping work.
I feel like I'm missing something, and I don't understand the purpose
of this syscall. Wouldn't the right solution be for the container to
have a stub module loader (maybe doable with a special /sbin/modprobe
or maybe a kernel patch would be needed, depending on the exact use
case) and have the stub call out to the container manager to request
the module? The container manager would check its security policy and
load the module or not load it as appropriate.
--Andy
On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <[email protected]> wrote:
> >
> > Hi,
> >
> > I would like to propose a new syscall that exposes the functionality of
> > request_module() to userspace.
> >
> > Propsed signature: request_module(char *module_name, char **args, int flags);
> > Where args and flags have to be NULL and 0 for the time being.
> >
> > Rationale:
> >
> > We are using nested, privileged containers which are loading kernel modules.
> > Currently we have to always pass around the contents of /lib/modules from the
> > root namespace which contains the modules.
> > (Also the containers need to have userspace components for moduleloading
> > installed)
> >
> > The syscall would remove the need for this bookkeeping work.
>
> I feel like I'm missing something, and I don't understand the purpose
> of this syscall. Wouldn't the right solution be for the container to
> have a stub module loader (maybe doable with a special /sbin/modprobe
> or maybe a kernel patch would be needed, depending on the exact use
> case) and have the stub call out to the container manager to request
> the module? The container manager would check its security policy and
> load the module or not load it as appropriate.
I don't see the need for a syscall like this yet either.
This should be the job of the container manager. modprobe just calls the
init_module() syscall, right?
If so the seccomp notifier can be used to intercept this system call for
the container and verify the module against an allowlist similar to how
we currently handle mount.
Christian
On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote:
> On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote:
> > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a new syscall that exposes the functionality of
> > > request_module() to userspace.
> > >
> > > Propsed signature: request_module(char *module_name, char **args, int flags);
> > > Where args and flags have to be NULL and 0 for the time being.
> > >
> > > Rationale:
> > >
> > > We are using nested, privileged containers which are loading kernel modules.
> > > Currently we have to always pass around the contents of /lib/modules from the
> > > root namespace which contains the modules.
> > > (Also the containers need to have userspace components for moduleloading
> > > installed)
> > >
> > > The syscall would remove the need for this bookkeeping work.
> >
> > I feel like I'm missing something, and I don't understand the purpose
> > of this syscall. Wouldn't the right solution be for the container to
> > have a stub module loader (maybe doable with a special /sbin/modprobe
> > or maybe a kernel patch would be needed, depending on the exact use
> > case) and have the stub call out to the container manager to request
> > the module? The container manager would check its security policy and
> > load the module or not load it as appropriate.
>
> I don't see the need for a syscall like this yet either.
>
> This should be the job of the container manager. modprobe just calls the
> init_module() syscall, right?
Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do.
But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called?
>
> If so the seccomp notifier can be used to intercept this system call for
> the container and verify the module against an allowlist similar to how
> we currently handle mount.
>
> Christian
>
On 2021-09-18T11:47-0700, Andy Lutomirski wrote:
> On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote:
> > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote:
> > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I would like to propose a new syscall that exposes the functionality of
> > > > request_module() to userspace.
> > > >
> > > > Propsed signature: request_module(char *module_name, char **args, int flags);
> > > > Where args and flags have to be NULL and 0 for the time being.
> > > >
> > > > Rationale:
> > > >
> > > > We are using nested, privileged containers which are loading kernel modules.
> > > > Currently we have to always pass around the contents of /lib/modules from the
> > > > root namespace which contains the modules.
> > > > (Also the containers need to have userspace components for moduleloading
> > > > installed)
> > > >
> > > > The syscall would remove the need for this bookkeeping work.
> > >
> > > I feel like I'm missing something, and I don't understand the purpose
> > > of this syscall. Wouldn't the right solution be for the container to
> > > have a stub module loader (maybe doable with a special /sbin/modprobe
> > > or maybe a kernel patch would be needed, depending on the exact use
> > > case) and have the stub call out to the container manager to request
> > > the module? The container manager would check its security policy and
> > > load the module or not load it as appropriate.
> >
> > I don't see the need for a syscall like this yet either.
> >
> > This should be the job of the container manager. modprobe just calls the
> > init_module() syscall, right?
>
> Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do.
>
> But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called?
The container is running an instance of the docker daemon in swarm mode.
That needs the "ip_vs" module (amongst others) and explicitly tries to load it
via modprobe.
> > If so the seccomp notifier can be used to intercept this system call for
> > the container and verify the module against an allowlist similar to how
> > we currently handle mount.
> >
> > Christian
> >
On Sun, Sep 19, 2021 at 12:56 AM Thomas Weißschuh <[email protected]> wrote:
>
> On 2021-09-18T11:47-0700, Andy Lutomirski wrote:
> > On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote:
> > > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote:
> > > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <[email protected]> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I would like to propose a new syscall that exposes the functionality of
> > > > > request_module() to userspace.
> > > > >
> > > > > Propsed signature: request_module(char *module_name, char **args, int flags);
> > > > > Where args and flags have to be NULL and 0 for the time being.
> > > > >
> > > > > Rationale:
> > > > >
> > > > > We are using nested, privileged containers which are loading kernel modules.
> > > > > Currently we have to always pass around the contents of /lib/modules from the
> > > > > root namespace which contains the modules.
> > > > > (Also the containers need to have userspace components for moduleloading
> > > > > installed)
> > > > >
> > > > > The syscall would remove the need for this bookkeeping work.
> > > >
> > > > I feel like I'm missing something, and I don't understand the purpose
> > > > of this syscall. Wouldn't the right solution be for the container to
> > > > have a stub module loader (maybe doable with a special /sbin/modprobe
> > > > or maybe a kernel patch would be needed, depending on the exact use
> > > > case) and have the stub call out to the container manager to request
> > > > the module? The container manager would check its security policy and
> > > > load the module or not load it as appropriate.
> > >
> > > I don't see the need for a syscall like this yet either.
> > >
> > > This should be the job of the container manager. modprobe just calls the
> > > init_module() syscall, right?
> >
> > Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do.
> >
> > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called?
>
> The container is running an instance of the docker daemon in swarm mode.
> That needs the "ip_vs" module (amongst others) and explicitly tries to load it
> via modprobe.
>
Do you mean it literally invokes /sbin/modprobe? If so, hooking this
at /sbin/modprobe and calling out to the container manager seems like
a decent solution.
> > > If so the seccomp notifier can be used to intercept this system call for
> > > the container and verify the module against an allowlist similar to how
> > > we currently handle mount.
> > >
> > > Christian
> > >
On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
> On 2021-09-19T07:37-0700, Andy Lutomirski wrote:
> > On Sun, Sep 19, 2021 at 12:56 AM Thomas Weißschuh <[email protected]> wrote:
> > >
> > > On 2021-09-18T11:47-0700, Andy Lutomirski wrote:
> > > > On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote:
> > > > > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote:
> > > > > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I would like to propose a new syscall that exposes the functionality of
> > > > > > > request_module() to userspace.
> > > > > > >
> > > > > > > Propsed signature: request_module(char *module_name, char **args, int flags);
> > > > > > > Where args and flags have to be NULL and 0 for the time being.
> > > > > > >
> > > > > > > Rationale:
> > > > > > >
> > > > > > > We are using nested, privileged containers which are loading kernel modules.
> > > > > > > Currently we have to always pass around the contents of /lib/modules from the
> > > > > > > root namespace which contains the modules.
> > > > > > > (Also the containers need to have userspace components for moduleloading
> > > > > > > installed)
> > > > > > >
> > > > > > > The syscall would remove the need for this bookkeeping work.
> > > > > >
> > > > > > I feel like I'm missing something, and I don't understand the purpose
> > > > > > of this syscall. Wouldn't the right solution be for the container to
> > > > > > have a stub module loader (maybe doable with a special /sbin/modprobe
> > > > > > or maybe a kernel patch would be needed, depending on the exact use
> > > > > > case) and have the stub call out to the container manager to request
> > > > > > the module? The container manager would check its security policy and
> > > > > > load the module or not load it as appropriate.
> > > > >
> > > > > I don't see the need for a syscall like this yet either.
> > > > >
> > > > > This should be the job of the container manager. modprobe just calls the
> > > > > init_module() syscall, right?
> > > >
> > > > Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do.
> > > >
> > > > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called?
> > >
> > > The container is running an instance of the docker daemon in swarm mode.
> > > That needs the "ip_vs" module (amongst others) and explicitly tries to load it
> > > via modprobe.
> > >
> >
> > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
> > at /sbin/modprobe and calling out to the container manager seems like
> > a decent solution.
>
> Yes it does. Thanks for the idea, I'll see how this works out.
Would documentation guiding you in that way have helped? If so
I welcome a patch that does just that.
Luis
On 2021-09-19T07:37-0700, Andy Lutomirski wrote:
> On Sun, Sep 19, 2021 at 12:56 AM Thomas Weißschuh <[email protected]> wrote:
> >
> > On 2021-09-18T11:47-0700, Andy Lutomirski wrote:
> > > On Thu, Sep 16, 2021, at 2:27 AM, Christian Brauner wrote:
> > > > On Wed, Sep 15, 2021 at 09:47:25AM -0700, Andy Lutomirski wrote:
> > > > > On Wed, Sep 15, 2021 at 8:50 AM Thomas Weißschuh <[email protected]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to propose a new syscall that exposes the functionality of
> > > > > > request_module() to userspace.
> > > > > >
> > > > > > Propsed signature: request_module(char *module_name, char **args, int flags);
> > > > > > Where args and flags have to be NULL and 0 for the time being.
> > > > > >
> > > > > > Rationale:
> > > > > >
> > > > > > We are using nested, privileged containers which are loading kernel modules.
> > > > > > Currently we have to always pass around the contents of /lib/modules from the
> > > > > > root namespace which contains the modules.
> > > > > > (Also the containers need to have userspace components for moduleloading
> > > > > > installed)
> > > > > >
> > > > > > The syscall would remove the need for this bookkeeping work.
> > > > >
> > > > > I feel like I'm missing something, and I don't understand the purpose
> > > > > of this syscall. Wouldn't the right solution be for the container to
> > > > > have a stub module loader (maybe doable with a special /sbin/modprobe
> > > > > or maybe a kernel patch would be needed, depending on the exact use
> > > > > case) and have the stub call out to the container manager to request
> > > > > the module? The container manager would check its security policy and
> > > > > load the module or not load it as appropriate.
> > > >
> > > > I don't see the need for a syscall like this yet either.
> > > >
> > > > This should be the job of the container manager. modprobe just calls the
> > > > init_module() syscall, right?
> > >
> > > Not quite so simple. modprobe parses things in /lib/modules and maybe /etc to decide what init_module() calls to do.
> > >
> > > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called?
> >
> > The container is running an instance of the docker daemon in swarm mode.
> > That needs the "ip_vs" module (amongst others) and explicitly tries to load it
> > via modprobe.
> >
>
> Do you mean it literally invokes /sbin/modprobe? If so, hooking this
> at /sbin/modprobe and calling out to the container manager seems like
> a decent solution.
Yes it does. Thanks for the idea, I'll see how this works out.
> > > > If so the seccomp notifier can be used to intercept this system call for
> > > > the container and verify the module against an allowlist similar to how
> > > > we currently handle mount.
> > > >
> > > > Christian
> > > >
On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <[email protected]> wrote:
>
> On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
> > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
> > > at /sbin/modprobe and calling out to the container manager seems like
> > > a decent solution.
> >
> > Yes it does. Thanks for the idea, I'll see how this works out.
>
> Would documentation guiding you in that way have helped? If so
> I welcome a patch that does just that.
If someone wants to make this classy, we should probably have the
container counterpart of a standardized paravirt interface. There
should be a way for a container to, in a runtime-agnostic way, issue
requests to its manager, and requesting a module by (name, Linux
kernel version for which that name makes sense) seems like an
excellent use of such an interface.
--Andy
>
> Luis
On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote:
> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <[email protected]> wrote:
> >
> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
>
> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
> > > > at /sbin/modprobe and calling out to the container manager seems like
> > > > a decent solution.
> > >
> > > Yes it does. Thanks for the idea, I'll see how this works out.
> >
> > Would documentation guiding you in that way have helped? If so
> > I welcome a patch that does just that.
>
> If someone wants to make this classy, we should probably have the
> container counterpart of a standardized paravirt interface. There
> should be a way for a container to, in a runtime-agnostic way, issue
> requests to its manager, and requesting a module by (name, Linux
> kernel version for which that name makes sense) seems like an
> excellent use of such an interface.
I always thought of this in two ways we currently do this:
1. Caller transparent container manager requests.
This is the seccomp notifier where we transparently handle syscalls
including intercepting init_module() where we parse out the module to
be loaded from the syscall args of the container and if it is
allow-listed load it for the container otherwise continue the syscall
letting it fail or failing directly through seccomp return value.
2. A process in the container explicitly calling out to the container
manager.
One example how this happens is systemd-nspawn via dbus messages
between systemd in the container and systemd outside the container to
e.g. allocate a new terminal in the container (kinda insecure but
that's another issue) or other stuff.
So what was your idea: would it be like a device file that could be
exposed to the container where it writes requestes to the container
manager? What would be the advantage to just standardizing a socket
protocol which is what we do for example (it doesn't do module loading
of course as we handle that differently):
## Container to host communication
LXD sets up a socket at `/dev/lxd/sock` which root in the container can
use to communicate with LXD on the host.
In LXD, this feature is implemented through a /dev/lxd/sock node which
is created and setup for all LXD instances.
This file is a Unix socket which processes inside the instance can
connect to. It's multi-threaded so multiple clients can be connected at
the same time.
Implementation details
LXD on the host binds /var/lib/lxd/devlxd/sock and starts listening for
new connections on it.
This socket is then exposed into every single instance started by LXD at
/dev/lxd/sock.
The single socket is required so we can exceed 4096 instances,
otherwise, LXD would have to bind a different socket for every instance,
quickly reaching the FD limit.
Authentication
Queries on /dev/lxd/sock will only return information related to the
requesting instance. To figure out where a request comes from, LXD will
extract the initial socket ucred and compare that to the list of
instances it manages.
Protocol
The protocol on /dev/lxd/sock is plain-text HTTP with JSON messaging, so
very similar to the local version of the LXD protocol.
Unlike the main LXD API, there is no background operation and no
authentication support in the /dev/lxd/sock API.
Christian
On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote:
> On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <[email protected]> wrote:
>> >
>> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
>>
>> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
>> > > > at /sbin/modprobe and calling out to the container manager seems like
>> > > > a decent solution.
>> > >
>> > > Yes it does. Thanks for the idea, I'll see how this works out.
>> >
>> > Would documentation guiding you in that way have helped? If so
>> > I welcome a patch that does just that.
>>
>> If someone wants to make this classy, we should probably have the
>> container counterpart of a standardized paravirt interface. There
>> should be a way for a container to, in a runtime-agnostic way, issue
>> requests to its manager, and requesting a module by (name, Linux
>> kernel version for which that name makes sense) seems like an
>> excellent use of such an interface.
>
> I always thought of this in two ways we currently do this:
>
> 1. Caller transparent container manager requests.
> This is the seccomp notifier where we transparently handle syscalls
> including intercepting init_module() where we parse out the module to
> be loaded from the syscall args of the container and if it is
> allow-listed load it for the container otherwise continue the syscall
> letting it fail or failing directly through seccomp return value.
Specific problems here include aliases and dependencies. My modules.alias file, for example, has:
alias net-pf-16-proto-16-family-wireguard wireguard
If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it.
>
> 2. A process in the container explicitly calling out to the container
> manager.
> One example how this happens is systemd-nspawn via dbus messages
> between systemd in the container and systemd outside the container to
> e.g. allocate a new terminal in the container (kinda insecure but
> that's another issue) or other stuff.
>
> So what was your idea: would it be like a device file that could be
> exposed to the container where it writes requestes to the container
> manager? What would be the advantage to just standardizing a socket
> protocol which is what we do for example (it doesn't do module loading
> of course as we handle that differently):
My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container.
I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive.
On Wed, Sep 22, 2021 at 08:34:23AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote:
> > On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote:
> >> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <[email protected]> wrote:
> >> >
> >> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
> >>
> >> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
> >> > > > at /sbin/modprobe and calling out to the container manager seems like
> >> > > > a decent solution.
> >> > >
> >> > > Yes it does. Thanks for the idea, I'll see how this works out.
> >> >
> >> > Would documentation guiding you in that way have helped? If so
> >> > I welcome a patch that does just that.
> >>
> >> If someone wants to make this classy, we should probably have the
> >> container counterpart of a standardized paravirt interface. There
> >> should be a way for a container to, in a runtime-agnostic way, issue
> >> requests to its manager, and requesting a module by (name, Linux
> >> kernel version for which that name makes sense) seems like an
> >> excellent use of such an interface.
> >
> > I always thought of this in two ways we currently do this:
> >
> > 1. Caller transparent container manager requests.
> > This is the seccomp notifier where we transparently handle syscalls
> > including intercepting init_module() where we parse out the module to
> > be loaded from the syscall args of the container and if it is
> > allow-listed load it for the container otherwise continue the syscall
> > letting it fail or failing directly through seccomp return value.
>
> Specific problems here include aliases and dependencies. My modules.alias file, for example, has:
>
> alias net-pf-16-proto-16-family-wireguard wireguard
>
> If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it.
You can't use the container's .ko module. For this you would need to
trust the image that the container wants you to load. The container
manager should always load a host module.
>
> >
> > 2. A process in the container explicitly calling out to the container
> > manager.
> > One example how this happens is systemd-nspawn via dbus messages
> > between systemd in the container and systemd outside the container to
> > e.g. allocate a new terminal in the container (kinda insecure but
> > that's another issue) or other stuff.
> >
> > So what was your idea: would it be like a device file that could be
> > exposed to the container where it writes requestes to the container
> > manager? What would be the advantage to just standardizing a socket
> > protocol which is what we do for example (it doesn't do module loading
> > of course as we handle that differently):
>
> My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container.
>
> I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive.
I don't see this is a big issue because that is fairly trivial.
I think we never want to trust the container's modules.
What probably should be happening is that the manager exposes a list of
modules the container can request in some form. We have precedence for
doing something like this.
So now modprobe and similar tools can be made aware that if they are in
a container they should request that module from the container manager
be it via a socket request or something else.
Nesting will be a bit funny but can probably be made to work by just
bind-mounting the outermost socket into the container or relaying the
request.
On Wed, Sep 22, 2021, at 8:52 AM, Christian Brauner wrote:
> On Wed, Sep 22, 2021 at 08:34:23AM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote:
>> > On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote:
>> >> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <[email protected]> wrote:
>> >> >
>> >> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
>> >>
>> >> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
>> >> > > > at /sbin/modprobe and calling out to the container manager seems like
>> >> > > > a decent solution.
>> >> > >
>> >> > > Yes it does. Thanks for the idea, I'll see how this works out.
>> >> >
>> >> > Would documentation guiding you in that way have helped? If so
>> >> > I welcome a patch that does just that.
>> >>
>> >> If someone wants to make this classy, we should probably have the
>> >> container counterpart of a standardized paravirt interface. There
>> >> should be a way for a container to, in a runtime-agnostic way, issue
>> >> requests to its manager, and requesting a module by (name, Linux
>> >> kernel version for which that name makes sense) seems like an
>> >> excellent use of such an interface.
>> >
>> > I always thought of this in two ways we currently do this:
>> >
>> > 1. Caller transparent container manager requests.
>> > This is the seccomp notifier where we transparently handle syscalls
>> > including intercepting init_module() where we parse out the module to
>> > be loaded from the syscall args of the container and if it is
>> > allow-listed load it for the container otherwise continue the syscall
>> > letting it fail or failing directly through seccomp return value.
>>
>> Specific problems here include aliases and dependencies. My modules.alias file, for example, has:
>>
>> alias net-pf-16-proto-16-family-wireguard wireguard
>>
>> If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it.
>
> You can't use the container's .ko module. For this you would need to
> trust the image that the container wants you to load. The container
> manager should always load a host module.
>
Agreed.
>>
>> >
>> > 2. A process in the container explicitly calling out to the container
>> > manager.
>> > One example how this happens is systemd-nspawn via dbus messages
>> > between systemd in the container and systemd outside the container to
>> > e.g. allocate a new terminal in the container (kinda insecure but
>> > that's another issue) or other stuff.
>> >
>> > So what was your idea: would it be like a device file that could be
>> > exposed to the container where it writes requestes to the container
>> > manager? What would be the advantage to just standardizing a socket
>> > protocol which is what we do for example (it doesn't do module loading
>> > of course as we handle that differently):
>>
>> My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container.
>>
>> I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive.
>
> I don't see this is a big issue because that is fairly trivial.
> I think we never want to trust the container's modules.
> What probably should be happening is that the manager exposes a list of
> modules the container can request in some form. We have precedence for
> doing something like this.
> So now modprobe and similar tools can be made aware that if they are in
> a container they should request that module from the container manager
> be it via a socket request or something else.
> Nesting will be a bit funny but can probably be made to work by just
> bind-mounting the outermost socket into the container or relaying the
> request.
Why bother with a list? I think it should be sufficient for the container to ask for a module and either get it or not get it.
On Wed, Sep 22, 2021 at 01:06:49PM -0700, Andy Lutomirski wrote:
>
>
> On Wed, Sep 22, 2021, at 8:52 AM, Christian Brauner wrote:
> > On Wed, Sep 22, 2021 at 08:34:23AM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 22, 2021, at 5:25 AM, Christian Brauner wrote:
> >> > On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote:
> >> >> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <[email protected]> wrote:
> >> >> >
> >> >> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
> >> >>
> >> >> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
> >> >> > > > at /sbin/modprobe and calling out to the container manager seems like
> >> >> > > > a decent solution.
> >> >> > >
> >> >> > > Yes it does. Thanks for the idea, I'll see how this works out.
> >> >> >
> >> >> > Would documentation guiding you in that way have helped? If so
> >> >> > I welcome a patch that does just that.
> >> >>
> >> >> If someone wants to make this classy, we should probably have the
> >> >> container counterpart of a standardized paravirt interface. There
> >> >> should be a way for a container to, in a runtime-agnostic way, issue
> >> >> requests to its manager, and requesting a module by (name, Linux
> >> >> kernel version for which that name makes sense) seems like an
> >> >> excellent use of such an interface.
> >> >
> >> > I always thought of this in two ways we currently do this:
> >> >
> >> > 1. Caller transparent container manager requests.
> >> > This is the seccomp notifier where we transparently handle syscalls
> >> > including intercepting init_module() where we parse out the module to
> >> > be loaded from the syscall args of the container and if it is
> >> > allow-listed load it for the container otherwise continue the syscall
> >> > letting it fail or failing directly through seccomp return value.
> >>
> >> Specific problems here include aliases and dependencies. My modules.alias file, for example, has:
> >>
> >> alias net-pf-16-proto-16-family-wireguard wireguard
> >>
> >> If I do modprobe net-pf-16-proto-16-family-wireguard, modprobe parses some files in /lib/modules/`uname -r` and issues init_module() asking for 'wireguard'. So hooking init_module() is at the wrong layer -- for that to work, the container's /sbin/modprobe needs to already have figured out that the desired module is wireguard and have a .ko for it.
> >
> > You can't use the container's .ko module. For this you would need to
> > trust the image that the container wants you to load. The container
> > manager should always load a host module.
> >
>
> Agreed.
>
> >>
> >> >
> >> > 2. A process in the container explicitly calling out to the container
> >> > manager.
> >> > One example how this happens is systemd-nspawn via dbus messages
> >> > between systemd in the container and systemd outside the container to
> >> > e.g. allocate a new terminal in the container (kinda insecure but
> >> > that's another issue) or other stuff.
> >> >
> >> > So what was your idea: would it be like a device file that could be
> >> > exposed to the container where it writes requestes to the container
> >> > manager? What would be the advantage to just standardizing a socket
> >> > protocol which is what we do for example (it doesn't do module loading
> >> > of course as we handle that differently):
> >>
> >> My idea is standardizing *something*. I think it would be nice if, for example, distros could ship a /sbin/modprobe that would do the right thing inside any compliant container runtime as well as when running outside a container.
> >>
> >> I suppose container managers could also bind-mount over /sbin/modprobe, but that's more intrusive.
> >
> > I don't see this is a big issue because that is fairly trivial.
> > I think we never want to trust the container's modules.
> > What probably should be happening is that the manager exposes a list of
> > modules the container can request in some form. We have precedence for
> > doing something like this.
> > So now modprobe and similar tools can be made aware that if they are in
> > a container they should request that module from the container manager
> > be it via a socket request or something else.
> > Nesting will be a bit funny but can probably be made to work by just
> > bind-mounting the outermost socket into the container or relaying the
> > request.
>
> Why bother with a list? I think it should be sufficient for the container to ask for a module and either get it or not get it.
I just meant that the programs in the container can see the modules
available on the host. Simplest thing could be bind-mounting in the
host's module folder with suitable protection (locked read-only mount).
But yeah, it can likely be as simple as allowing it to ask for a module
and not bother telling it about what is available.
On 9/24/21 06:19, Christian Brauner wrote:
> On Wed, Sep 22, 2021 at 01:06:49PM -0700, Andy Lutomirski wrote:
> I just meant that the programs in the container can see the modules
> available on the host. Simplest thing could be bind-mounting in the
> host's module folder with suitable protection (locked read-only mount).
> But yeah, it can likely be as simple as allowing it to ask for a module
> and not bother telling it about what is available.
>
If the container gets to see host modules, interesting races when
containers are migrated CRIU-style will result.
On 2021-09-19 09:56+0200, Thomas Weißschuh wrote:
> On 2021-09-18T11:47-0700, Andy Lutomirski wrote:
> > But I admit I’m a bit confused. What exactly is the container doing that causes the container’s copy of modprobe to be called?
>
> The container is running an instance of the docker daemon in swarm mode.
> That needs the "ip_vs" module (amongst others) and explicitly tries to load it
> via modprobe.
If somebody stumbles upon this specific issue:
The "ip_vs" module will be autoloaded in future kernel versions with
https://lore.kernel.org/lkml/[email protected]/
applied.
> > > If so the seccomp notifier can be used to intercept this system call for
> > > the container and verify the module against an allowlist similar to how
> > > we currently handle mount.
> > >
> > > Christian