2007-11-01 14:43:31

by Ingo Molnar

[permalink] [raw]
Subject: [patch] PID namespace design bug, workaround


while checking recent commits to the kernel core i took a look at the
PID namespaces implementation, and it has a fatal flaw: it breaks
futexes and various libraries (and other stuff) that use PIDs as the
means of identifying tasks, by not providing any means of global
identification that works across PID namespaces. (PIDs _are_ a very
convenient and global way of identifying contexts.)

i asked Ulrich about this and it turns out he has warned about this
early on:

http://www.nabble.com/Re%3A-question%3A-pid-space-semantics.-p3409990.html

but this problem is still present in the code, and it has been recently
committed into mainline via:

commit 30e49c263e36341b60b735cbef5ca37912549264
Author: Pavel Emelyanov <[email protected]>
Date: Thu Oct 18 23:40:10 2007 -0700

pid namespaces: allow cloning of new namespace

without these problems having been resolved. A full-scale revert is
probably too intrusive, but at minimum we need to turn off user-space
access to this feature via this simple patch. Until this issue is
resolved properly the new PID namespace code needs to be turned off.
Letting this into 2.6.24 would be a disaster.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/fork.c | 8 ++++++++
1 file changed, 8 insertions(+)

Index: v/kernel/fork.c
===================================================================
--- v.orig/kernel/fork.c
+++ v/kernel/fork.c
@@ -1420,6 +1420,14 @@ long do_fork(unsigned long clone_flags,
int trace = 0;
long nr;

+ /*
+ * PID namespaces are broken at the moment: they do not allow
+ * certain PID based syscalls (such as futexes) to be used
+ * across namespaces. This is broken and must not be allowed,
+ * so we keep this feature turned off until it's properly fixed.
+ */
+ clone_flags &= ~CLONE_NEWPID;
+
if (unlikely(current->ptrace)) {
trace = fork_traceflag (clone_flags);
if (trace)


2007-11-01 14:52:26

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ingo Molnar wrote:
> while checking recent commits to the kernel core i took a look at the
> PID namespaces implementation, and it has a fatal flaw: it breaks
> futexes and various libraries (and other stuff) that use PIDs as the
> means of identifying tasks, by not providing any means of global
> identification that works across PID namespaces. (PIDs _are_ a very
> convenient and global way of identifying contexts.)
>
> i asked Ulrich about this and it turns out he has warned about this
> early on:
>
> http://www.nabble.com/Re%3A-question%3A-pid-space-semantics.-p3409990.html
>
> but this problem is still present in the code, and it has been recently
> committed into mainline via:
>
> commit 30e49c263e36341b60b735cbef5ca37912549264
> Author: Pavel Emelyanov <[email protected]>
> Date: Thu Oct 18 23:40:10 2007 -0700
>
> pid namespaces: allow cloning of new namespace
>
> without these problems having been resolved. A full-scale revert is
> probably too intrusive, but at minimum we need to turn off user-space
> access to this feature via this simple patch. Until this issue is
> resolved properly the new PID namespace code needs to be turned off.
> Letting this into 2.6.24 would be a disaster.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> kernel/fork.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> Index: v/kernel/fork.c
> ===================================================================
> --- v.orig/kernel/fork.c
> +++ v/kernel/fork.c
> @@ -1420,6 +1420,14 @@ long do_fork(unsigned long clone_flags,
> int trace = 0;
> long nr;
>
> + /*
> + * PID namespaces are broken at the moment: they do not allow
> + * certain PID based syscalls (such as futexes) to be used
> + * across namespaces. This is broken and must not be allowed,
> + * so we keep this feature turned off until it's properly fixed.
> + */
> + clone_flags &= ~CLONE_NEWPID;
> +

Well, emm. Eric already tried to solve this issue in the similar way
(http://lkml.org/lkml/2007/10/26/414), but I have recently sent a
more generic patch set. It turns all the namespaces off with the
config options, but Andrew said to wait until the next -mm tree to
rework the set.

With this set we'll be able to mark pid namespaces as EXPERIMENTAL
or even BROKEN, so nobody will be able to crate them. So can we, please,
keep things as they are for now - the appropriate fix will be ready
soon.

Thanks,
Pavel

> if (unlikely(current->ptrace)) {
> trace = fork_traceflag (clone_flags);
> if (trace)
>

2007-11-01 14:53:59

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:
> + clone_flags &= ~CLONE_NEWPID;

I think the call should rather fail than silently drop the bit but aside
from that I agree. The problems we'd run into if the feature is getting
used as-is are severe.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHKehh2ijCOnn/RHQRAqHAAJkBu7Uj8T5J2ZlLty096zXH7IVcwACfRhlt
EpwnZ1UodJXJiPpxGN8FEYo=
=S/kB
-----END PGP SIGNATURE-----

2007-11-01 14:56:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Thu, 2007-11-01 at 17:51 +0300, Pavel Emelyanov wrote:

> So can we, please,
> keep things as they are for now - the appropriate fix will be ready
> soon.

Just for the curious, could you outline on how you intend to fix this?

2007-11-01 14:57:28

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pavel Emelyanov wrote:
> With this set we'll be able to mark pid namespaces as EXPERIMENTAL
> or even BROKEN, so nobody will be able to crate them. So can we, please,
> keep things as they are for now - the appropriate fix will be ready
> soon.

You sound far too optimistic for my taste. I probably haven't seen the
proposal you have in mind but everything else I have seen simply doesn't
work without breaking something.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHKek22ijCOnn/RHQRAgJIAJ0VYwYHUKdEcnKfZHDdaUr5HTnk9QCgghTH
n57LDahLDIIVIIlkrwNVLLQ=
=2xU8
-----END PGP SIGNATURE-----

2007-11-01 15:02:28

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ingo Molnar wrote:
> while checking recent commits to the kernel core i took a look at the
> PID namespaces implementation, and it has a fatal flaw: it breaks
> futexes and various libraries (and other stuff) that use PIDs as the
> means of identifying tasks, by not providing any means of global
> identification that works across PID namespaces. (PIDs _are_ a very

You're not 100% correct here. The task_pid_nr() does return you a
unique pid, so you do have the way to identify the task.

Another thing - you should *not* allow tasks to communicate across
pid namespaces using any pids - this just breaks the pid namespaces
idea.

As far as the futexes are concerned - I do not allow threads live
in different pid namespaces (more correct fix would be not to allow
tasks share the mm_struct across pid namespaces, but this is a one
line fix), so the situation when you have two threads in different
namespaces is impossible.

Thanks,
Pavel

> convenient and global way of identifying contexts.)
>
> i asked Ulrich about this and it turns out he has warned about this
> early on:
>
> http://www.nabble.com/Re%3A-question%3A-pid-space-semantics.-p3409990.html
>
> but this problem is still present in the code, and it has been recently
> committed into mainline via:
>
> commit 30e49c263e36341b60b735cbef5ca37912549264
> Author: Pavel Emelyanov <[email protected]>
> Date: Thu Oct 18 23:40:10 2007 -0700
>
> pid namespaces: allow cloning of new namespace
>
> without these problems having been resolved. A full-scale revert is
> probably too intrusive, but at minimum we need to turn off user-space
> access to this feature via this simple patch. Until this issue is
> resolved properly the new PID namespace code needs to be turned off.
> Letting this into 2.6.24 would be a disaster.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> kernel/fork.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> Index: v/kernel/fork.c
> ===================================================================
> --- v.orig/kernel/fork.c
> +++ v/kernel/fork.c
> @@ -1420,6 +1420,14 @@ long do_fork(unsigned long clone_flags,
> int trace = 0;
> long nr;
>
> + /*
> + * PID namespaces are broken at the moment: they do not allow
> + * certain PID based syscalls (such as futexes) to be used
> + * across namespaces. This is broken and must not be allowed,
> + * so we keep this feature turned off until it's properly fixed.
> + */
> + clone_flags &= ~CLONE_NEWPID;
> +
> if (unlikely(current->ptrace)) {
> trace = fork_traceflag (clone_flags);
> if (trace)
>

2007-11-01 15:05:54

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Pavel Emelyanov wrote:
>> With this set we'll be able to mark pid namespaces as EXPERIMENTAL
>> or even BROKEN, so nobody will be able to crate them. So can we, please,
>> keep things as they are for now - the appropriate fix will be ready
>> soon.
>
> You sound far too optimistic for my taste. I probably haven't seen the
> proposal you have in mind but everything else I have seen simply doesn't
> work without breaking something.

The "fix" I mention is just returning -EINVAL in case user orders
CLONE_NEWPIDS and compiling out all the namespace cloning code. This
is just a more elegant way to get rid of pid namespaces rather than
Ingo proposed.

Here's the root of the set:
http://lkml.org/lkml/2007/10/31/118

Thanks,
Pavel

2007-11-01 15:06:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround


* Ulrich Drepper <[email protected]> wrote:

> Ingo Molnar wrote:
> > + clone_flags &= ~CLONE_NEWPID;
>
> I think the call should rather fail than silently drop the bit but
> aside from that I agree. The problems we'd run into if the feature is
> getting used as-is are severe.

does the patch below look OK to you?

Ingo

------------------->
From: Ingo Molnar <[email protected]>
Subject: PID namespaces: turn them off for now

while checking recent commits to the kernel core i took a look at the
PID namespaces implementation, and it has a fatal flaw: it breaks
futexes and various libraries (and other stuff) that use PIDs as the
means of identifying tasks, by not providing any means of global
identification that works across PID namespaces. (PIDs _are_ a very
convenient and global way of identifying contexts.)

i asked Ulrich about this and it turns out he has warned about this
early on:

http://www.nabble.com/Re%3A-question%3A-pid-space-semantics.-p3409990.html

but this problem is still present in the code, and it has been recently
committed into mainline via:

commit 30e49c263e36341b60b735cbef5ca37912549264
Author: Pavel Emelyanov <[email protected]>
Date: Thu Oct 18 23:40:10 2007 -0700

pid namespaces: allow cloning of new namespace

without these problems having been resolved. A full-scale revert is
probably too intrusive, but at minimum we need to turn off user-space
access to this feature via this simple patch. Until this issue is
resolved properly the new PID namespace code needs to be turned off.
Letting this into 2.6.24 would be a disaster.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/fork.c | 9 +++++++++
1 file changed, 9 insertions(+)

Index: v/kernel/fork.c
===================================================================
--- v.orig/kernel/fork.c
+++ v/kernel/fork.c
@@ -1420,6 +1420,15 @@ long do_fork(unsigned long clone_flags,
int trace = 0;
long nr;

+ /*
+ * PID namespaces are broken at the moment: they do not allow
+ * certain PID based syscalls (such as futexes) to be used
+ * across namespaces. This is broken and must not be allowed,
+ * so we keep this feature turned off until it's properly fixed.
+ */
+ if (clone_flags & CLONE_NEWPID)
+ return -ENOSYS;
+
if (unlikely(current->ptrace)) {
trace = fork_traceflag (clone_flags);
if (trace)

2007-11-01 15:07:27

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Peter Zijlstra wrote:
> On Thu, 2007-11-01 at 17:51 +0300, Pavel Emelyanov wrote:
>
>> So can we, please,
>> keep things as they are for now - the appropriate fix will be ready
>> soon.
>
> Just for the curious, could you outline on how you intend to fix this?

I have already answered to Ulrich about this. Just to
make this sub-thread consistent:

The "fix" I mention is just returning -EINVAL in case user orders
CLONE_NEWPIDS and compiling out all the namespace cloning code. This
is just a more elegant way to get rid of pid namespaces rather than
Ingo proposed.

Here's the root of the set:
http://lkml.org/lkml/2007/10/31/118

Thanks,
Pavel

2007-11-01 15:18:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround


* Pavel Emelyanov <[email protected]> wrote:

> The "fix" I mention is just returning -EINVAL in case user orders
> CLONE_NEWPIDS and compiling out all the namespace cloning code. This
> is just a more elegant way to get rid of pid namespaces rather than
> Ingo proposed.

unfortunately i have to NACK that approach. We never allowed broken
user-space visible APIs into the kernel like that because it just gives
a vector for that breakage to become de-facto used and forced upon the
core kernel. Even if they can be .config turned off. That's just a lame
excuse that delays the fixing of it. We may mark features that have a
good expectation to be fixed as CONFIG_EXPERIMENTAL, and we may mark
drivers that nobody maintains anymore as CONFIG_BROKEN, but we dont
introduce new core syscall features with CONFIG_BROKEN! We never did and
i hope we never will.

The _only_ way to force the fixing of such type of breakages is to not
offer them _at all_. Really, you are proposing a major new extension to
lots of important core Linux APIs so please try to solve this problem
cleanly, it's really severe. Right now as things stand this containers
sub-feature is "a little bit pregnant". This is one of the few cases
where we really _must_ say no.

Ingo

2007-11-01 15:31:18

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ingo Molnar wrote:
> * Pavel Emelyanov <[email protected]> wrote:
>
>> The "fix" I mention is just returning -EINVAL in case user orders
>> CLONE_NEWPIDS and compiling out all the namespace cloning code. This
>> is just a more elegant way to get rid of pid namespaces rather than
>> Ingo proposed.
>
> unfortunately i have to NACK that approach. We never allowed broken
> user-space visible APIs into the kernel like that because it just gives
> a vector for that breakage to become de-facto used and forced upon the
> core kernel. Even if they can be .config turned off. That's just a lame
> excuse that delays the fixing of it. We may mark features that have a
> good expectation to be fixed as CONFIG_EXPERIMENTAL, and we may mark

Pid namespaces have more than a good expectations to be fixed, so
feel free to mark the (currently pending) PID_NS config option with
"depends on EXPERIMENTAL".

All the problems I know are slowly getting fixed, but most of them are
just related to "bad pid value is reported to the user space when we
work inside some new namespace".

Unfortunately, as I can see, all the discussions of pid namespaces
happen behind my bask, so all I can is just fix the problems I'm
aware of.

> drivers that nobody maintains anymore as CONFIG_BROKEN, but we dont
> introduce new core syscall features with CONFIG_BROKEN! We never did and
> i hope we never will.
>
> The _only_ way to force the fixing of such type of breakages is to not
> offer them _at all_. Really, you are proposing a major new extension to
> lots of important core Linux APIs so please try to solve this problem
> cleanly, it's really severe. Right now as things stand this containers

I'm sure, that no *new* problems appear in case you don't enter the new
namespace, so just disable the pid space creation code (with EXPERIMENTAL
option) and live with the original kernel.

If you point me one, I'd be glad to fix it.

> sub-feature is "a little bit pregnant". This is one of the few cases
> where we really _must_ say no.
>
> Ingo
>

2007-11-01 16:12:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Thu, 2007-11-01 at 07:56 -0700, Ulrich Drepper wrote:
> Pavel Emelyanov wrote:
> > With this set we'll be able to mark pid namespaces as EXPERIMENTAL
> > or even BROKEN, so nobody will be able to crate them. So can we, please,
> > keep things as they are for now - the appropriate fix will be ready
> > soon.
>
> You sound far too optimistic for my taste. I probably haven't seen the
> proposal you have in mind but everything else I have seen simply doesn't
> work without breaking something.

Yeah, we definitely realize that this inhibits things that were
perfectly fine before.

As Eric mentioned in his reply to your message last year, the primary
goal here is isolation. We'd eventually like to be able to pick a
container up and move it to another system. That's going to be awfully
hard if the container is sharing a resource with a part of the system
which is not moving.

Pid namespaces (along with the others) give us the isolation to keep
these interactions from happening except in a controlled manner,
breaking the ties that might bind it to one particular system.

Think of how many user-visible apis deal with files and filenames.
However, there can certainly be files that are unavailable to certain
processes based on their membership in a particular filesystem
namespaces. In fact, we use chroot() to try and _make_ certain files
unavailable.

-- Dave

2007-11-01 18:58:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Thu, Nov 01, 2007 at 04:05:37PM +0100, Ingo Molnar wrote:
> + if (clone_flags & CLONE_NEWPID)
> + return -ENOSYS;

I'd use EINVAL instead of ENOSYS...

- Ted

2007-11-01 19:55:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround


* Theodore Tso <[email protected]> wrote:

> On Thu, Nov 01, 2007 at 04:05:37PM +0100, Ingo Molnar wrote:
> > + if (clone_flags & CLONE_NEWPID)
> > + return -ENOSYS;
>
> I'd use EINVAL instead of ENOSYS...

ok, updated patch below.

Ingo

---------------->
From: Ingo Molnar <[email protected]>
Subject: PID namespaces: turn them off for now

while checking recent commits to the kernel core i took a look at the
PID namespaces implementation, and it has a fatal flaw: it breaks
futexes and various libraries (and other stuff) that use PIDs as the
means of identifying tasks, by not providing any means of global
identification that works across PID namespaces. (PIDs _are_ a very
convenient and global way of identifying contexts.)

i asked Ulrich about this and it turns out he has warned about this
early on:

http://www.nabble.com/Re%3A-question%3A-pid-space-semantics.-p3409990.html

but this problem is still present in the code, and it has been recently
committed into mainline via:

commit 30e49c263e36341b60b735cbef5ca37912549264
Author: Pavel Emelyanov <[email protected]>
Date: Thu Oct 18 23:40:10 2007 -0700

pid namespaces: allow cloning of new namespace

without these problems having been resolved. A full-scale revert is
probably too intrusive, but at minimum we need to turn off user-space
access to this feature via this simple patch. Until this issue is
resolved properly the new PID namespace code needs to be turned off.
Letting this into 2.6.24 would be a disaster.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/fork.c | 9 +++++++++
1 file changed, 9 insertions(+)

Index: v/kernel/fork.c
===================================================================
--- v.orig/kernel/fork.c
+++ v/kernel/fork.c
@@ -1420,6 +1420,15 @@ long do_fork(unsigned long clone_flags,
int trace = 0;
long nr;

+ /*
+ * PID namespaces are broken at the moment: they do not allow
+ * certain PID based syscalls (such as futexes) to be used
+ * across namespaces. This is broken and must not be allowed,
+ * so we keep this feature turned off until it's properly fixed.
+ */
+ if (clone_flags & CLONE_NEWPID)
+ return -EINVAL;
+
if (unlikely(current->ptrace)) {
trace = fork_traceflag (clone_flags);
if (trace)

2007-11-02 00:22:18

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pavel Emelyanov wrote:
> The "fix" I mention is just returning -EINVAL in case user orders
> CLONE_NEWPIDS

That is the "fix" you were referring to? I was hoping you have a sketch
for a real solution. If nobody can think of a way to fix this PID
namespaces are IMO not something which should go in at all.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHKm2R2ijCOnn/RHQRAgjXAKCkU9lcWC9aTR0nG89x47AZO9pVfwCgiaVC
/Giyp+en+VbtfFyD8D6v4Xk=
=RnIw
-----END PGP SIGNATURE-----

2007-11-02 00:23:44

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ingo Molnar wrote:

> but this problem is still present in the code, and it has been recently
> committed into mainline via:
>
> commit 30e49c263e36341b60b735cbef5ca37912549264
> Author: Pavel Emelyanov <[email protected]>
> Date: Thu Oct 18 23:40:10 2007 -0700
>
> pid namespaces: allow cloning of new namespace
>
> without these problems having been resolved. A full-scale revert is
> probably too intrusive, but at minimum we need to turn off user-space
> access to this feature via this simple patch. Until this issue is
> resolved properly the new PID namespace code needs to be turned off.
> Letting this into 2.6.24 would be a disaster.
>
> Signed-off-by: Ingo Molnar <[email protected]>

Acked-by: Ulrich Drepper <[email protected]>


- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHKm3n2ijCOnn/RHQRAn7dAJ9PhfhLg29mTELwH7qLXwgJcyNi9QCgr7sc
WQa4QBNesktzPKh5vcCulhM=
=cYnF
-----END PGP SIGNATURE-----

2007-11-02 07:55:42

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Pavel Emelyanov wrote:
>> The "fix" I mention is just returning -EINVAL in case user orders
>> CLONE_NEWPIDS
>
> That is the "fix" you were referring to? I was hoping you have a sketch
> for a real solution. If nobody can think of a way to fix this PID

Looks like we misunderstood each other. Can you please elaborate on
what exactly is broken in pid namespaces?

> namespaces are IMO not something which should go in at all.

We (mainly me and Sukadev) are fixing the issues we are aware of, but
looks like somebody found something and didn't notify the authors
about it.

Thanks,
Pavel

> - --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
>
> iD8DBQFHKm2R2ijCOnn/RHQRAgjXAKCkU9lcWC9aTR0nG89x47AZO9pVfwCgiaVC
> /Giyp+en+VbtfFyD8D6v4Xk=
> =RnIw
> -----END PGP SIGNATURE-----
>

2007-11-02 08:05:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Fri, 02 Nov 2007 10:55:02 +0300 Pavel Emelyanov <[email protected]> wrote:

> Ulrich Drepper wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Pavel Emelyanov wrote:
> >> The "fix" I mention is just returning -EINVAL in case user orders
> >> CLONE_NEWPIDS
> >
> > That is the "fix" you were referring to? I was hoping you have a sketch
> > for a real solution. If nobody can think of a way to fix this PID
>
> Looks like we misunderstood each other. Can you please elaborate on
> what exactly is broken in pid namespaces?

Isn't it this?

http://lkml.org/lkml/2007/11/1/141

2007-11-02 08:15:26

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Andrew Morton wrote:
> On Fri, 02 Nov 2007 10:55:02 +0300 Pavel Emelyanov <[email protected]> wrote:
>
>> Ulrich Drepper wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Pavel Emelyanov wrote:
>>>> The "fix" I mention is just returning -EINVAL in case user orders
>>>> CLONE_NEWPIDS
>>> That is the "fix" you were referring to? I was hoping you have a sketch
>>> for a real solution. If nobody can think of a way to fix this PID
>> Looks like we misunderstood each other. Can you please elaborate on
>> what exactly is broken in pid namespaces?
>
> Isn't it this?
>
> http://lkml.org/lkml/2007/11/1/141

That was the initial problem, and I already answered to Ingo about
it - pid, obtained in one pid namespace shouldn't be used in another.
This is not a design bug, but a design idea. If he managed to get two
threads in different namespaces, then we should fix this ability (but
I thought that I handled it - the copy_pid_ns call doesn't allow to
create a new thread in a new namespace:

new_ns = ERR_PTR(-EINVAL);
if (flags & CLONE_THREAD)
goto out_put;

) I should have first asked Ingo about how he managed to get two
threads in different namespaces to fix this, but Ulrich said that

"everything else I have seen simply doesn't work without
breaking something"

so I asked him to elaborate on this - what _else_ doesn't work.

Thanks,
Pavel

2007-11-02 14:06:31

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pavel Emelyanov wrote:
>> Isn't it this?
>>
>> http://lkml.org/lkml/2007/11/1/141
>
> That was the initial problem, and I already answered to Ingo about
> it

No, look at my old mail which Ingo referenced in that posting.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHKy692ijCOnn/RHQRAtYLAJ98EXTGl3HMlCbVXOkL7TJRFfw4DACfcgYI
HHz5f7TfM05Dps+ruPRiUrU=
=IjS4
-----END PGP SIGNATURE-----

2007-11-02 14:35:56

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Pavel Emelyanov wrote:
>>> Isn't it this?
>>>
>>> http://lkml.org/lkml/2007/11/1/141
>> That was the initial problem, and I already answered to Ingo about
>> it
>
> No, look at my old mail which Ingo referenced in that posting.

You pointed only one problem that is not a variation of "how do
we handle the case when we pass our pid outside the namespace".

This problem with signals is now being resolved at IBM by Sukadev
and Serge (I put them in Cc), so this is about to be fixed by the
time 2.6.24 releases (I hope).

As far as the "passing the pid outside the namespace" is concerned,
is my answer "pids should never be used outside the namespace they
came from, otherwise userspace won't work as expected" satisfactory?

So is "everything else", you mentioned, covered with the problems
above?

> - --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
>
> iD8DBQFHKy692ijCOnn/RHQRAtYLAJ98EXTGl3HMlCbVXOkL7TJRFfw4DACfcgYI
> HHz5f7TfM05Dps+ruPRiUrU=
> =IjS4
> -----END PGP SIGNATURE-----
>

2007-11-02 15:35:21

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pavel Emelyanov wrote:
> So is "everything else", you mentioned, covered with the problems
> above?

No, it's not. If you'd read the mail carefully you'd notice that the
use of PIDs especially in robust futexes is part of the API and that it
simply isn't acceptable to say "don't do that". A robust mutex can be
stored in any file and as long as two processes have access to the same
file (or they can pass each other shared memory) the underlying futex
functionality simply must work.

This whole approach to allow switching on and off each of the namespaces
is just wrong. Do it all or nothing, at least for the problematic ones
like NEWPID. Having access to the same filesystem but using separate
PID namespaces is simply not going to work.

You also brush completely over the SysV IPC issue.

And I doubt that I spent enough time thinking about all this to arrive
at the more subtle problems. I don't think especially the PID namespace
is ready at all at this time.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD4DBQFHK0N42ijCOnn/RHQRAkPyAJiDR9ZEPUbCdEa2xk+Te80B7avDAJ4mgy7v
jgtZG129yBUGBrpQ8fbn7w==
=ho0Z
-----END PGP SIGNATURE-----

2007-11-02 16:15:09

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Pavel Emelyanov wrote:
>> So is "everything else", you mentioned, covered with the problems
>> above?
>
> No, it's not. If you'd read the mail carefully you'd notice that the
> use of PIDs especially in robust futexes is part of the API and that it
> simply isn't acceptable to say "don't do that". A robust mutex can be
> stored in any file and as long as two processes have access to the same
> file (or they can pass each other shared memory) the underlying futex
> functionality simply must work.

This is the case when you export the pid to the user level outside
the namespace. This case is not supposed to work at all. I know it
and there's noting we can do with it. (some more comments about this
below)

> This whole approach to allow switching on and off each of the namespaces
> is just wrong. Do it all or nothing, at least for the problematic ones
> like NEWPID. Having access to the same filesystem but using separate
> PID namespaces is simply not going to work.

I'd like to note, that the original reason to switch the namespace off
was to help embedded people get rid of the functionality they don't
need and save the vmlinux size. Since Ingo proposed to disable the
namespace creation in a ... strange way, I noticed, that there will be
a more elegant way to do this. This was not the "fix" for cross-namespaces
communications.

Nevertheless...

Having access to the same IPCs in different pid namespaces won't work.
Having access to the same filesystem in different IPC namespaces won't work.
Having access to the same UID namespace in different VFS namespaces won't work.
Having access to the same <any> namespace in different <many others> namespace
wont' work.

That's the idea OpenVZ tried to promote when the story with "containers"
started, but most of the other participants decided that we can create
individual namespaces and step-by-step try to make them work in all the
possible combinations.

Right now we have a pid namespace, which

a) works fine in the initial namespace (by this I mean that it doesn't
introduce *new* bugs);
b) mostly works in the sub namespace. some work is to be done and it
is being done;
c) doesn't work in some ways (but not at all) when tasks communicate
across the namespace boundary, but is not going to by definition.

I'm also looking for a good solution on how to workaround the
"c" case, but I'm not agree with the statement that "the pid
namespaces are completely broken". They are not completely broken,
but there is just some work to do with the case "b" and some way
to be invented to disable the case "c".

> You also brush completely over the SysV IPC issue.

I did not - this problem is only relevant when you try to setup the
IPC communication between processes from different namespaces, but
I have already answered this question.

If you use IPC within a single namespaces everything works just fine.

> And I doubt that I spent enough time thinking about all this to arrive
> at the more subtle problems. I don't think especially the PID namespace
> is ready at all at this time.
>
> - --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
>
> iD4DBQFHK0N42ijCOnn/RHQRAkPyAJiDR9ZEPUbCdEa2xk+Te80B7avDAJ4mgy7v
> jgtZG129yBUGBrpQ8fbn7w==
> =ho0Z
> -----END PGP SIGNATURE-----
>

2007-11-02 17:30:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Fri, 2007-11-02 at 01:04 -0700, Andrew Morton wrote:
> > > That is the "fix" you were referring to? I was hoping you have a sketch
> > > for a real solution. If nobody can think of a way to fix this PID
> >
> > Looks like we misunderstood each other. Can you please elaborate on
> > what exactly is broken in pid namespaces?
>
> Isn't it this?
>
> http://lkml.org/lkml/2007/11/1/141

I think we're still a bit murky on exactly what the issues are. Ingo,
Ulrich, is this the right track? The kind of issues that you're
concerned about?

There are certainly more of these, but here is one In the futex
userspace address, we install the current pid's vnr into a userspace
address.

static int futex_lock_pi(u32 __user *uaddr, struct rw_semaphore *fshared,
int detect, ktime_t *time, int trylock)
{
...
newval = task_pid_vnr(current);
curval = cmpxchg_futex_value_locked(uaddr, 0, newval);

We obviously don't have any restrictions on who else might be mapping
that address, so that pid can theoretically leak out to any other task.
In another pid namespace, the pid at that userspace address is certainly
nonsensical.

-- Dave

2007-11-02 17:46:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround



On Fri, 2 Nov 2007, Dave Hansen wrote:
>
> There are certainly more of these, but here is one In the futex
> userspace address, we install the current pid's vnr into a userspace
> address.

Now, realistically, why not just say "you can't use these things across
namespaces"? Does anybody really care? After all, somebody who screws this
up only screws himself, not anybody else.

Linus

2007-11-02 21:40:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Fri, Nov 02, 2007 at 06:58:47PM +0300, Pavel Emelyanov wrote:
> Having access to the same IPCs in different pid namespaces won't work.
> Having access to the same filesystem in different IPC namespaces won't work.
> Having access to the same UID namespace in different VFS namespaces won't work.
> Having access to the same <any> namespace in different <many others> namespace
> wont' work.
>
> That's the idea OpenVZ tried to promote when the story with "containers"
> started, but most of the other participants decided that we can create
> individual namespaces and step-by-step try to make them work in all the
> possible combinations.

Heh. Well, this won't be the first time that we go around the design
circle wiht people objecting with the idea eventually figuring out
that the original idea really was the only sane way to do things. :-)

Maybe it would be instructive to create a matrix which lists areas
where processes that share namespace FOO but not namespace BAR would
result in breakage, with an explanation of what breaks in a particular
instance? Assuming we continue to go down the path of orthogonal
namespace, having a file in Documentation/ which lists places where
there different namepsaces have dependencies on each other for correct
system call operation would be a Good Thing.

- Ted

2007-11-03 04:03:20

by Nicholas Miell

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Fri, 2007-11-02 at 10:39 -0700, Linus Torvalds wrote:
>
> On Fri, 2 Nov 2007, Dave Hansen wrote:
> >
> > There are certainly more of these, but here is one In the futex
> > userspace address, we install the current pid's vnr into a userspace
> > address.
>
> Now, realistically, why not just say "you can't use these things across
> namespaces"? Does anybody really care? After all, somebody who screws this
> up only screws himself, not anybody else.
>
> Linus

Accessing the same robust futex from different PID namespaces on the
same machine via a shared file mapping is logically equivalent to
accessing the same robust futex from different machines via a shared
filesystem and there's no reason to expect either operation to work
correctly.

--
Nicholas Miell <[email protected]>

2007-11-03 04:34:47

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pavel Emelyanov wrote:
> Having access to the same IPCs in different pid namespaces won't work.
> Having access to the same filesystem in different IPC namespaces won't work.
> Having access to the same UID namespace in different VFS namespaces won't work.
> Having access to the same <any> namespace in different <many others> namespace
> wont' work.
> [...]


Then explicitly prevent the cases which cannot work in the clone()
calls. Yes, giving people rope to shoot themselves is a Unix tradition
but it's so unnecessary in this case and will only cause support
problems for innocent people.

I bet the result will be that if you have a separate PID namespace you
need to enforce every other namespace as well. There are simply too
many dependencies.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHK/pL2ijCOnn/RHQRAtp6AKC8QIRvJa4qVUSx9IVpRq6X+6HPGQCff/hT
m2tpKWmeM+xAfS5ICvB0NVk=
=5ozn
-----END PGP SIGNATURE-----

2007-11-03 20:02:22

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Pavel Emelianov [[email protected]] wrote:
| Ulrich Drepper wrote:
| > -----BEGIN PGP SIGNED MESSAGE-----
| > Hash: SHA1
| >
| > Pavel Emelyanov wrote:
| >>> Isn't it this?
| >>>
| >>> http://lkml.org/lkml/2007/11/1/141
| >> That was the initial problem, and I already answered to Ingo about
| >> it
| >
| > No, look at my old mail which Ingo referenced in that posting.
|
| You pointed only one problem that is not a variation of "how do
| we handle the case when we pass our pid outside the namespace".
|
| This problem with signals is now being resolved at IBM by Sukadev
| and Serge (I put them in Cc), so this is about to be fixed by the
| time 2.6.24 releases (I hope).

Yes. We (Oleg, Eric included in Cc) have a patchset to address signals
issues in child pid namespaces. It is being discussed on Containers list:

https://lists.linux-foundation.org/pipermail/containers/2007-October/008240.html

We will post the patchset to LKML soon.

|
| As far as the "passing the pid outside the namespace" is concerned,
| is my answer "pids should never be used outside the namespace they
| came from, otherwise userspace won't work as expected" satisfactory?
|
| So is "everything else", you mentioned, covered with the problems
| above?
|
| > - --
| > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
| > -----BEGIN PGP SIGNATURE-----
| > Version: GnuPG v1.4.7 (GNU/Linux)
| >
| > iD8DBQFHKy692ijCOnn/RHQRAtYLAJ98EXTGl3HMlCbVXOkL7TJRFfw4DACfcgYI
| > HHz5f7TfM05Dps+ruPRiUrU=
| > =IjS4
| > -----END PGP SIGNATURE-----
| >

2007-11-03 20:13:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround


* Linus Torvalds <[email protected]> wrote:

> On Fri, 2 Nov 2007, Dave Hansen wrote:
> >
> > There are certainly more of these, but here is one In the futex
> > userspace address, we install the current pid's vnr into a userspace
> > address.
>
> Now, realistically, why not just say "you can't use these things
> across namespaces"? Does anybody really care? After all, somebody who
> screws this up only screws himself, not anybody else.

i see two main categories of problems:

- one problem is that this condition is 'invisible'. If two namespaces
happen to access the same robust futex (say a yum update from two
PID namespaces sharing the same read-mostly filesystem) there's silent
breakage and data corruption due to PID overlap. The other
namespaces have no such problems. I think the "dont do that" answer is
lame because most apps _will_ work across PID namespaces because
things like fcntl based locking does work. And there's no valid
technical excuse why futexes shouldnt work: it's all controlled by the
same native kernel, there's no untrusted network separating the nodes,
etc.

- so via this we isolate an important category of syscalls from
cross-namespace use perhaps forever. Pick just about any other kernel
resource and they can be shared between namespaces. But not futexes -
which happen to be the most scalable locking primitive and people will
almost certainly want to use them across namespaces. A
completely new breed of futexes has to be introduced and trickled
through userspace and all the architectures to make it work again
across namespaces. Who will do that work? Generally the people who
introduce a new concept are the ones who should do that. But in this
case they are apparently not interested in making it generic enough
(they are concentrated on their 'isolate it all' aspect) so
nobody else will do and we are stuck with an incomplete concept.

The answer of user-space/apps is predictable: they'll gravitate towards
the path of least resistance, and that will be "dont use futexes". PID
namespaces basically single out an important API category and use the
natural pressure of the other 300 syscalls and tens of thousands of apps
against this category. Linux is basically used against itself. The
counter-force is relatively weak and there's no solution available _at
all_ presently so it's not even the fight of patches against each other,
it's the sheer lack of a feature which has an obvious end-result.

We've already got way too many incomplete concepts and APIs in the
kernel. Maybe i'm over-worrying, but i fear we end up like with
capabilities or sendfile - code merged too soon and never completed for
many years - perhaps never completed at all. VMS and WNT did those
things a bit better i think - their API frameworks were/are pervasive
and complete, even in the corner cases.

Whether it's the right approach to force reasonable perfection of
frameworks like this from the get go is another question - but in
practice even for relatively popular new APIs like epoll we see a way
too slow movement towards the 'completion of the API', and that hinders
adoption of new APIs very much. (With splice being a notable exception -
there the central concept was so strong that it quickly pushed itself to
total completion - combined with a capable maintainer of the API.) But
it's not that easy for futexes and we put another roadblock in the path
of futexes.

Ingo

2007-11-03 22:41:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround



On Sat, 3 Nov 2007, Ingo Molnar wrote:
>
> - one problem is that this condition is 'invisible'. If two namespaces
> happen to access the same robust futex (say a yum update from two
> PID namespaces sharing the same read-mostly filesystem) there's silent
> breakage and data corruption due to PID overlap.

.. and this is in *no* way different from thousands of applications that
write their pid to lock-files, and others decide that it's "stale" because
using "kill(pid, 0)" returns that the pid doesn't exist any more.

The solution? You can't do that kind of locking over NFS, or across pid
namespaces. Nobody blames NFS or pid namespaces for it.

> - so via this we isolate an important category of syscalls from
> cross-namespace use perhaps forever.

So? That's inherent to how those stupid stable mutexes work.

I don't understand how you can call this a "PID namespace design bug",
when it clearly has nothing what-so-ever to do with pid namespaces, and
everything to do with the *futexes* that blithely assume that pid's are
unique and that made it part of the user-visible interface.

OF COURSE any pid namespace design will always break such assumptions, but
that's not because of any PID namespace bugs. It's what the whole *point*
of PID namespaces are. If you use pid's (instead of some opaque cookies),
you will not be able to use such things across pid-separation.

Linus

2007-11-03 23:56:05

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Sat, 3 Nov 2007 15:40:48 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> I don't understand how you can call this a "PID namespace design
> bug", when it clearly has nothing what-so-ever to do with pid
> namespaces, and everything to do with the *futexes* that blithely
> assume that pid's are unique and that made it part of the
> user-visible interface.
>
> OF COURSE any pid namespace design will always break such
> assumptions, but that's not because of any PID namespace bugs. It's
> what the whole *point* of PID namespaces are. If you use pid's
> (instead of some opaque cookies), you will not be able to use such
> things across pid-separation.

well... kind of.
THere are 2 things around pid namespaces: which pids you can see/touch
(in proc or signals or otherwise), and the non-uniqueness.

For containers you clearly want the first part... but... is there a
strong reason to not just *not* create duplicate pids even across
namespaces? there's no rule in posix or anything similar to fd's afaik
concerning which pids we can hand out... so we could just make then
unique globally but just with limited visibility....

2007-11-04 00:14:18

by David Lang

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

On Sat, 3 Nov 2007, Arjan van de Ven wrote:

> On Sat, 3 Nov 2007 15:40:48 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
>> I don't understand how you can call this a "PID namespace design
>> bug", when it clearly has nothing what-so-ever to do with pid
>> namespaces, and everything to do with the *futexes* that blithely
>> assume that pid's are unique and that made it part of the
>> user-visible interface.
>>
>> OF COURSE any pid namespace design will always break such
>> assumptions, but that's not because of any PID namespace bugs. It's
>> what the whole *point* of PID namespaces are. If you use pid's
>> (instead of some opaque cookies), you will not be able to use such
>> things across pid-separation.
>
> well... kind of.
> THere are 2 things around pid namespaces: which pids you can see/touch
> (in proc or signals or otherwise), and the non-uniqueness.
>
> For containers you clearly want the first part... but... is there a
> strong reason to not just *not* create duplicate pids even across
> namespaces? there's no rule in posix or anything similar to fd's afaik
> concerning which pids we can hand out... so we could just make then
> unique globally but just with limited visibility....

two problems that I can think of

1. the container people would like to eventually have the ability to
migrate containers from one system to another (or to suspend a container)
in this sort of case trying to fit the allocated PIDs from the container
into a running system is a problem if PIDs are not allowed to overlap.

2. it seems to me that there is porobably a latent security issue in
having a global PID namespace with just limited visability. the types of
bugs that may let you affect a process seem easier to make if the only
protection is visability rather then complete seperation.

David Lang

2007-11-04 07:24:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

[email protected] writes:

> Pavel Emelianov [[email protected]] wrote:
> | Ulrich Drepper wrote:
> | > -----BEGIN PGP SIGNED MESSAGE-----
> | > Hash: SHA1
> | >
> | > Pavel Emelyanov wrote:
> | >>> Isn't it this?
> | >>>
> | >>> http://lkml.org/lkml/2007/11/1/141
> | >> That was the initial problem, and I already answered to Ingo about
> | >> it
> | >
> | > No, look at my old mail which Ingo referenced in that posting.
> |
> | You pointed only one problem that is not a variation of "how do
> | we handle the case when we pass our pid outside the namespace".
> |
> | This problem with signals is now being resolved at IBM by Sukadev
> | and Serge (I put them in Cc), so this is about to be fixed by the
> | time 2.6.24 releases (I hope).
>
> Yes. We (Oleg, Eric included in Cc) have a patchset to address signals
> issues in child pid namespaces. It is being discussed on Containers list:
>
> https://lists.linux-foundation.org/pipermail/containers/2007-October/008240.html
>
> We will post the patchset to LKML soon.

Yes. Getting all of the cross namespace cases working that we can is a
goal. Currently I don't know if we can do better with the futexes that
have pids in the user/kernel ABI. Plain futexes should be fine.

Implementation wise si_pid is a bit of a pain but we should have that
one sorted out shortly. It is a well understood and we just need to get
the code right.

The pids in the sysvipc space should also be fairly simple to handle
just do a classic struct pid conversion. And convert from struct pid
to a pid_t right at the user/kernel interface.

Unless I missed something we already properly handle giving people
usable pids from the tty layer.

Getting pids working properly for unix domain socket credential when
we cross pid namespaces is another case that needs a struct pid
conversion to get things working, but the kernel should be able to
do the right thing at that point.

In summary when pids are stored inside the kernel we have all of the
needed infrastructure with struct pid to handle doing the right
thing processes communicate between pid namespaces.

Right now we just need to go through every place in the kernel make
certain we haven't over looked something we can handle. And there
are a lot of places where the kernel uses pids....

Eric

2007-11-04 10:39:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] PID namespaces


(changed the Subject line)

* Linus Torvalds <[email protected]> wrote:

> On Sat, 3 Nov 2007, Ingo Molnar wrote:
> >
> > - one problem is that this condition is 'invisible'. If two
> > namespaces happen to access the same robust futex (say a yum
> > update from two PID namespaces sharing the same read-mostly
> > filesystem) there's silent breakage and data corruption due to PID
> > overlap.
>
> .. and this is in *no* way different from thousands of applications
> that write their pid to lock-files, and others decide that it's
> "stale" because using "kill(pid, 0)" returns that the pid doesn't
> exist any more.
>
> The solution? You can't do that kind of locking over NFS, or across
> pid namespaces. Nobody blames NFS or pid namespaces for it.

the difference to NFS is that for PID namespaces we do have a single
trusted kernel that fully controls all the domains so there's no obvious
"hard barrier of trust" that people could perceive as a showstopper.

We've got a global kernel and unlike other namespaces there's (almost)
no "directed allocation" done of specific PIDs (unlike files, socket
addresses or fds). So the PID is a cookie that is 99.9% shaped _by the
kernel already_. [there are a few exceptions but those are much less
problematic than the lack of global PIDs is] So we might as well shape
the cookies in a way that keeps them global. What is the technological
reason for not keeping PIDs globally unique? We've cited a good number
of reasons why it's desirable - it's a pretty damn useful cookie for
identifying tasks. (it's also very scalable - PID -> task lookup is
completely lockless.)

I.e. keep the namespace functionality but use a modulo 1.000.000 base
for the PIDs so that it all looks nicer to the user. Minimal visibility
difference but maximum compatibility. (The resulting limits are
reasonable: 1 million tasks per container and 4 million containers on a
single 32-bit box.) We could still restrict cross-namespace API use but
all the cases where a global PID is desirable would still all work. I
might be missing something obvious though.

The reason why i bring this up now is because 2.6.24 is an
all-or-nothing flag day for this detail. Once it's out there we wont
realistically be able to change any of these details. (And in general
i'm very supportive of the containers concept - a year ago at the KS i
was one of the very few proponents of quickly merging containers into
the kernel.)

Ingo

2007-11-04 20:12:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch] PID namespaces

On Sun, 2007-11-04 at 11:38 +0100, Ingo Molnar wrote:
> I.e. keep the namespace functionality but use a modulo 1.000.000 base
> for the PIDs so that it all looks nicer to the user. Minimal visibility
> difference but maximum compatibility. (The resulting limits are
> reasonable: 1 million tasks per container and 4 million containers on a
> single 32-bit box.) We could still restrict cross-namespace API use but
> all the cases where a global PID is desirable would still all work. I
> might be missing something obvious though.

There is definitely a great deal of desire to have containers look as
much as possible like a normally functioning system. That includes
having an init process. Everything today depends on that init process
having a pretty specific pid. That's definitely one of the 0.1% of
things that isn't really shaped by the kernel, but it's a pretty
important one 0.1%. (Linux Vserver does this pid virtualization, but
_only_ for init, btw.)

We also need to consider the needs of a checkpoint/restart system. Most
of my interest in containers comes because of their isolation
properties. That isolation is what lets us pick a container up and move
it more easily across systems.

But, once we've moved the container, all of that "single, global kernel"
stuff goes out the window because it wasn't just one kernel making
decisions. Plus, those pids stop becoming just cookies that were issued
by one kernel and interpreted by one kernel.

-- Dave

2007-11-05 14:47:21

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [patch] PID namespaces

On Sunday 04 November 2007 10:38, Ingo Molnar wrote:
> I.e. keep the namespace functionality but use a modulo 1.000.000 base
> for the PIDs so that it all looks nicer to the user. Minimal visibility
> difference but maximum compatibility. (The resulting limits are
> reasonable: 1 million tasks per container and 4 million containers on a
> single 32-bit box.) We could still restrict cross-namespace API use but

4 thousand, not 4 million.
--
vda

2007-11-06 08:27:41

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [patch] PID namespace design bug, workaround

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Pavel Emelyanov wrote:
>> Having access to the same IPCs in different pid namespaces won't work.
>> Having access to the same filesystem in different IPC namespaces won't work.
>> Having access to the same UID namespace in different VFS namespaces won't work.
>> Having access to the same <any> namespace in different <many others> namespace
>> wont' work.
>> [...]
>
>
> Then explicitly prevent the cases which cannot work in the clone()
> calls. Yes, giving people rope to shoot themselves is a Unix tradition
> but it's so unnecessary in this case and will only cause support
> problems for innocent people.

:)

> I bet the result will be that if you have a separate PID namespace you
> need to enforce every other namespace as well. There are simply too
> many dependencies.

I think, that Ted's proposal (about the "namespaces compatibility matrix") is
better. I'd prefer knowing of what can stop working in case I do something
rather that forcedly having my hands off this.

Thanks,
Pavel

2007-11-20 22:55:55

by Eric W. Biederman

[permalink] [raw]
Subject: Futexes and network filesystems.

Ingo Molnar <[email protected]> writes:

> * Linus Torvalds <[email protected]> wrote:
>
>> On Fri, 2 Nov 2007, Dave Hansen wrote:
>> >
>> > There are certainly more of these, but here is one In the futex
>> > userspace address, we install the current pid's vnr into a userspace
>> > address.
>>
>> Now, realistically, why not just say "you can't use these things
>> across namespaces"? Does anybody really care? After all, somebody who
>> screws this up only screws himself, not anybody else.
>
> i see two main categories of problems:
>
> - one problem is that this condition is 'invisible'.
>
> - so via this we isolate an important category of syscalls from
> cross-namespace use perhaps forever.

I had a chance to think about this a bit more, and realized that
the problem is that futexes don't appear to work on network
filesystems, even if the network filesystems provide coherent shared
memory.

It seems to me that we need to have a call that gets a unique token
for a process for each filesystem per filesystem for use in futexes
(especially robust futexes). Say get_fs_task_id(const char *path);

On local filesystems this could just be the pid as we use today, but
for filesystems that can be accessed from contexts with potentially
overlapping pid values this could be something else. It is an extra
syscall in the preparation path, but it should be hardly more
expensive the current getpid().

Once we have fixed the futex infrastructure to be able to handle
futexes on network filesystems, the pid namespace case will be trivial
to implement.

Eric




2007-11-21 06:17:53

by Kyle Moffett

[permalink] [raw]
Subject: Re: Futexes and network filesystems.

On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:
> I had a chance to think about this a bit more, and realized that
> the problem is that futexes don't appear to work on network
> filesystems, even if the network filesystems provide coherent
> shared memory.
>
> It seems to me that we need to have a call that gets a unique token
> for a process for each filesystem per filesystem for use in futexes
> (especially robust futexes). Say get_fs_task_id(const char *path);
>
> On local filesystems this could just be the pid as we use today,
> but for filesystems that can be accessed from contexts with
> potentially overlapping pid values this could be something else.
> It is an extra syscall in the preparation path, but it should be
> hardly more expensive the current getpid().
>
> Once we have fixed the futex infrastructure to be able to handle
> futexes on network filesystems, the pid namespace case will be
> trivial to implement.

Actually, I would think that get_vm_task_id(void *addr) would be a
more useful interface. The call would still be a relatively simple
lookup to find the struct file associated with the particular virtual
mapping, but it would be race-free from the perspective of userspace
and would not require that we somehow figure out the file descriptor
associated with a particular mmap() (which may be closed by this
point in time). Useful extension would be the get_fd_task_id(int fd)
and get_fs_task_id(const char *path), but those are less important.

The other important thing is to ensure that somehow the numbers are
considered unique only within the particular domain of a container,
such that you can migrate a container from one system to another even
using a simple local ext3 filesystem (on a networked block device)
and still be able to have things work properly even after the
migration. Naturally this would only work with an upgraded libc but
I think that's a reasonable requirement to enforce for migration of
futexes and cross-network futexes.

Even for network filesystems which don't implement coherent shared
memory, you might add a memexcl() system call which (when used by
multiple cooperating processes) ensures that a given page is only
ever mapped by at most one computer accessing a given network
filesystem. The page-outs and page-ins when shuttling that page
across the network would be expensive, but I believe the cost would
be reasonable for many applications and it would allow traditional
atomic ops on the mapped pages to take and release futexes in the
uncontended case.

Cheers,
Kyle Moffett

2007-11-21 06:42:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Futexes and network filesystems.

Kyle Moffett <[email protected]> writes:

> On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:
>> I had a chance to think about this a bit more, and realized that the problem
>> is that futexes don't appear to work on network filesystems, even if the
>> network filesystems provide coherent shared memory.
>>
>> It seems to me that we need to have a call that gets a unique token for a
>> process for each filesystem per filesystem for use in futexes (especially
>> robust futexes). Say get_fs_task_id(const char *path);
>>
>> On local filesystems this could just be the pid as we use today, but for
>> filesystems that can be accessed from contexts with potentially overlapping
>> pid values this could be something else. It is an extra syscall in the
>> preparation path, but it should be hardly more expensive the current getpid().
>>
>> Once we have fixed the futex infrastructure to be able to handle futexes on
>> network filesystems, the pid namespace case will be trivial to implement.
>
> Actually, I would think that get_vm_task_id(void *addr) would be a more useful
> interface. The call would still be a relatively simple lookup to find the
> struct file associated with the particular virtual mapping, but it would be
> race-free from the perspective of userspace and would not require that we
> somehow figure out the file descriptor associated with a particular mmap()
> (which may be closed by this point in time). Useful extension would be the
> get_fd_task_id(int fd) and get_fs_task_id(const char *path), but those are less
> important.

You are probably right. The important thing is that we get an interface where
user space can cache the result. Or else you totally loose the benefit of
avoiding trapping into the kernel.

> The other important thing is to ensure that somehow the numbers are considered
> unique only within the particular domain of a container, such that you can
> migrate a container from one system to another even using a simple local ext3
> filesystem (on a networked block device) and still be able to have things work
> properly even after the migration. Naturally this would only work with an
> upgraded libc but I think that's a reasonable requirement to enforce for
> migration of futexes and cross-network futexes.

Well. The numbers are unique per filesystem. Which is what should save you.
The numbers can't simply be unique per container.

> Even for network filesystems which don't implement coherent shared memory, you
> might add a memexcl() system call which (when used by multiple cooperating
> processes) ensures that a given page is only ever mapped by at most one
> computer accessing a given network filesystem. The page-outs and page-ins when
> shuttling that page across the network would be expensive, but I believe the
> cost would be reasonable for many applications and it would allow traditional
> atomic ops on the mapped pages to take and release futexes in the uncontended
> case.

Sounds reasonable to me.

Eric