2008-06-20 03:11:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch -mm 0/4] mqueue namespace

Cedric Le Goater <[email protected]> writes:

> Hello !
>
> Here's a small patchset introducing a new namespace for POSIX
> message queues.
>
> Nothing really complex a part from the mqueue filesystem which
> needed some special care

This looks stalled. I have a brainstorm that might takes a totally
different perspective on things.

The only reason we don't just allow multiple mounts of mqueuefs to
solve this problem is because there is a kernel syscall on the path.

If we just hard coded a mount point into the kernel and required user
space to always mount mqueuefs there the problem would be solved.

hard coding a mount point is unfortunately violates the unix rule
of separating mechanism and policy.

One way to fix that is to add a hidden directory to the mnt namespace.
Where magic in kernel filesystems can be mounted. Only visible
with a magic openat flag. Then:

fd = openat(AT_FDKERN, ".", O_DIRECTORY)
fchdir(fd);
umount("./mqueue", MNT_DETACH);
mount(("none", "./mqueue", "mqueue", 0, NULL);

Would unshare the mqueue namespace.

Implemented for plan9 this would solve a problem of how do you get
access to all of it's special filesystems. As only bind mounts
and remote filesystem mounts are available. For linux thinking about
it might shake the conversation up a bit.

Eric


2008-06-20 03:41:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch -mm 0/4] mqueue namespace

[email protected] (Eric W. Biederman) writes:

> One way to fix that is to add a hidden directory to the mnt namespace.
> Where magic in kernel filesystems can be mounted. Only visible
> with a magic openat flag. Then:
>
> fd = openat(AT_FDKERN, ".", O_DIRECTORY)
> fchdir(fd);
> umount("./mqueue", MNT_DETACH);
> mount(("none", "./mqueue", "mqueue", 0, NULL);
>
> Would unshare the mqueue namespace.
>
> Implemented for plan9 this would solve a problem of how do you get
> access to all of it's special filesystems. As only bind mounts
> and remote filesystem mounts are available. For linux thinking about
> it might shake the conversation up a bit.

Thinking about this some more. What is especially attractive if we do
all namespaces this way is that it solves two lurking problems.
1) How do you keep a namespace around without a process in it.
2) How do you enter a container.

If we could land the namespaces in the filesystem we could easily
persist them past the point where a process is present in one if we so
choose.

Entering a container would be a matter of replacing your current
namespaces mounts with namespace mounts take from the filesystem.

I expect performance would degrade in practice, but it is tempting
to implement it and run a benchmark and see if we can measure anything.

Eric

2008-06-20 14:50:43

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [patch -mm 0/4] mqueue namespace

Quoting Eric W. Biederman ([email protected]):
> Cedric Le Goater <[email protected]> writes:
>
> > Hello !
> >
> > Here's a small patchset introducing a new namespace for POSIX
> > message queues.
> >
> > Nothing really complex a part from the mqueue filesystem which
> > needed some special care
>
> This looks stalled.

It actually isn't really - Cedric had resent it a few weeks ago but had
troubles with the mail server so it never hit the lists. I think Dave
made a few more changes from there and was getting ready to resend
again. Dave?

> I have a brainstorm that might takes a totally
> different perspective on things.
>
> The only reason we don't just allow multiple mounts of mqueuefs to
> solve this problem is because there is a kernel syscall on the path.
>
> If we just hard coded a mount point into the kernel and required user
> space to always mount mqueuefs there the problem would be solved.
>
> hard coding a mount point is unfortunately violates the unix rule
> of separating mechanism and policy.
>
> One way to fix that is to add a hidden directory to the mnt namespace.
> Where magic in kernel filesystems can be mounted. Only visible
> with a magic openat flag. Then:
>
> fd = openat(AT_FDKERN, ".", O_DIRECTORY)
> fchdir(fd);
> umount("./mqueue", MNT_DETACH);
> mount(("none", "./mqueue", "mqueue", 0, NULL);
>
> Would unshare the mqueue namespace.
>
> Implemented for plan9 this would solve a problem of how do you get
> access to all of it's special filesystems. As only bind mounts
> and remote filesystem mounts are available. For linux thinking about
> it might shake the conversation up a bit.

It is unfortunate that two actions are needed to properly complete the
unshare, and we had definately talked about just using the mount before.
I forget why we decided it wasn't practical, so maybe what you describe
solves it...

But at least the current patch reuses CLONE_NEWIPC for posix ipc, which
also seems to make sense.

-serge

2008-06-20 14:53:35

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [patch -mm 0/4] mqueue namespace

Quoting Eric W. Biederman ([email protected]):
> [email protected] (Eric W. Biederman) writes:
>
> > One way to fix that is to add a hidden directory to the mnt namespace.
> > Where magic in kernel filesystems can be mounted. Only visible
> > with a magic openat flag. Then:
> >
> > fd = openat(AT_FDKERN, ".", O_DIRECTORY)
> > fchdir(fd);
> > umount("./mqueue", MNT_DETACH);
> > mount(("none", "./mqueue", "mqueue", 0, NULL);
> >
> > Would unshare the mqueue namespace.
> >
> > Implemented for plan9 this would solve a problem of how do you get
> > access to all of it's special filesystems. As only bind mounts
> > and remote filesystem mounts are available. For linux thinking about
> > it might shake the conversation up a bit.
>
> Thinking about this some more. What is especially attractive if we do
> all namespaces this way is that it solves two lurking problems.
> 1) How do you keep a namespace around without a process in it.
> 2) How do you enter a container.
>
> If we could land the namespaces in the filesystem we could easily
> persist them past the point where a process is present in one if we so
> choose.
>
> Entering a container would be a matter of replacing your current
> namespaces mounts with namespace mounts take from the filesystem.
>
> I expect performance would degrade in practice, but it is tempting
> to implement it and run a benchmark and see if we can measure anything.

The device ns could be a mount of an fs with the devices created in it,
while mknod becomes a symlink from that fs. And once a network
namespace is a filesystem, we can aim for the plan9 NAT solution of
mounting a remote /net onto ours. Neat.

But bye-bye posix?

-serge

2008-06-20 19:21:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch -mm 0/4] mqueue namespace

"Serge E. Hallyn" <[email protected]> writes:

>
> It is unfortunate that two actions are needed to properly complete the
> unshare, and we had definately talked about just using the mount before.
> I forget why we decided it wasn't practical, so maybe what you describe
> solves it...

What is worse, and I don't see a way around it: Is that we don't have
any callbacks to check where things are mounted. So we can't ensure the
proper kind of filesystem is mounted in the right place.

That is there is too much freedom in the mount apis to allow for reliable
operation.

> But at least the current patch reuses CLONE_NEWIPC for posix ipc, which
> also seems to make sense.

Sort of. I'm really annoyed with whoever did the posix mqueue support.
Adding the magic syscall that has to know the internal mount instead of
requiring the thing be mounted somewhere and just rejecting filedescriptors
for the wrong sorts of files.

Eric

2008-08-29 09:47:48

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [patch -mm 0/4] mqueue namespace

Eric W. Biederman wrote:
> [email protected] (Eric W. Biederman) writes:
>
>> One way to fix that is to add a hidden directory to the mnt namespace.
>> Where magic in kernel filesystems can be mounted. Only visible
>> with a magic openat flag. Then:
>>
>> fd = openat(AT_FDKERN, ".", O_DIRECTORY)
>> fchdir(fd);
>> umount("./mqueue", MNT_DETACH);
>> mount(("none", "./mqueue", "mqueue", 0, NULL);
>>
>> Would unshare the mqueue namespace.
>>
>> Implemented for plan9 this would solve a problem of how do you get
>> access to all of it's special filesystems. As only bind mounts
>> and remote filesystem mounts are available. For linux thinking about
>> it might shake the conversation up a bit.
>
> Thinking about this some more. What is especially attractive if we do
> all namespaces this way is that it solves two lurking problems.
> 1) How do you keep a namespace around without a process in it.
> 2) How do you enter a container.
>
> If we could land the namespaces in the filesystem we could easily
> persist them past the point where a process is present in one if we so
> choose.
>
> Entering a container would be a matter of replacing your current
> namespaces mounts with namespace mounts take from the filesystem.
>
> I expect performance would degrade in practice, but it is tempting
> to implement it and run a benchmark and see if we can measure anything.

http://wiki.openvz.org/Containers/Mini-summit_2008_notes

you seem to have talked about this idea at the summit but the notes
are a bit short on the "entering a container" topic. Have you had time to
work on the POC the notes are talking about ?

the mqueue namespace (and sysv ipc) is typically one of these namespaces
with valid objects which can have no processes in it.

C.