2012-02-03 00:10:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/1][V5] Add reboot_pid_ns to handle the reboot syscall

On Thu, 5 Jan 2012 10:06:50 +0100
Daniel Lezcano <[email protected]> wrote:

> In the case of a child pid namespace, rebooting the system does not
> really makes sense. When the pid namespace is used in conjunction
> with the other namespaces in order to create a linux container, the
> reboot syscall leads to some problems.
>
> A container can reboot the host. That can be fixed by dropping
> the sys_reboot capability but we are unable to correctly to poweroff/
> halt/reboot a container and the container stays stuck at the shutdown
> time with the container's init process waiting indefinitively.
>
> After several attempts, no solution from userspace was found to reliabily
> handle the shutdown from a container.
>
> This patch propose to make the init process of the child pid namespace to
> exit with a signal status set to : SIGINT if the child pid namespace called
> "halt/poweroff" and SIGHUP if the child pid namespace called "reboot".
> When the reboot syscall is called and we are not in the initial
> pid namespace, we kill the pid namespace for "HALT", "POWEROFF", "RESTART",
> and "RESTART2". Otherwise we return EINVAL.
>
> Returning EINVAL is also an easy way to check if this feature is supported
> by the kernel when invoking another 'reboot' option like CAD.
>
> By this way the parent process of the child pid namespace knows if
> it rebooted or not and can take the right decision.

Looks OK, although the comments need help. Is the below still true?

Do you think it would be feasible to put your testcase into
tools/testing/selftests? I'm thinking "no", because running the test
needs elevated permissions and might reboot the user's machine(!).


--- a/include/linux/pid_namespace.h~pidns-add-reboot_pid_ns-to-handle-the-reboot-syscall-fix
+++ a/include/linux/pid_namespace.h
@@ -32,7 +32,7 @@ struct pid_namespace {
#endif
gid_t pid_gid;
int hide_pid;
- int reboot;
+ int reboot; /* group exit code if this pidns was rebooted */
};

extern struct pid_namespace init_pid_ns;
--- a/kernel/sys.c~pidns-add-reboot_pid_ns-to-handle-the-reboot-syscall-fix
+++ a/kernel/sys.c
@@ -444,9 +444,10 @@ SYSCALL_DEFINE4(reboot, int, magic1, int
magic2 != LINUX_REBOOT_MAGIC2C))
return -EINVAL;

- /* In case the pid namespaces are enabled, the current task is in a
- * child pid_namespace and the command is handled by 'reboot_pid_ns',
- * this one will invoke 'do_exit'.
+ /*
+ * If pid namespaces are enabled and the current task is in a child
+ * pid_namespace, the command is handled by reboot_pid_ns() which will
+ * call do_exit().
*/
ret = reboot_pid_ns(task_active_pid_ns(current), cmd);
if (ret)

> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -32,6 +32,7 @@ struct pid_namespace {
> #endif
> gid_t pid_gid;
> int hide_pid;
> + int reboot;
> };

This was particuarly distressing. The field was poorly named and other
people forgotting to document their data structures doesn't mean that
we should continue to do this!


2012-02-03 08:59:42

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 1/1][V5] Add reboot_pid_ns to handle the reboot syscall

On 02/03/2012 01:10 AM, Andrew Morton wrote:
> On Thu, 5 Jan 2012 10:06:50 +0100
> Daniel Lezcano<[email protected]> wrote:
>
>> In the case of a child pid namespace, rebooting the system does not
>> really makes sense. When the pid namespace is used in conjunction
>> with the other namespaces in order to create a linux container, the
>> reboot syscall leads to some problems.
>>
>> A container can reboot the host. That can be fixed by dropping
>> the sys_reboot capability but we are unable to correctly to poweroff/
>> halt/reboot a container and the container stays stuck at the shutdown
>> time with the container's init process waiting indefinitively.
>>
>> After several attempts, no solution from userspace was found to reliabily
>> handle the shutdown from a container.
>>
>> This patch propose to make the init process of the child pid namespace to
>> exit with a signal status set to : SIGINT if the child pid namespace called
>> "halt/poweroff" and SIGHUP if the child pid namespace called "reboot".
>> When the reboot syscall is called and we are not in the initial
>> pid namespace, we kill the pid namespace for "HALT", "POWEROFF", "RESTART",
>> and "RESTART2". Otherwise we return EINVAL.
>>
>> Returning EINVAL is also an easy way to check if this feature is supported
>> by the kernel when invoking another 'reboot' option like CAD.
>>
>> By this way the parent process of the child pid namespace knows if
>> it rebooted or not and can take the right decision.
> Looks OK, although the comments need help. Is the below still true?

Yes, thanks for fixing this.

>
> Do you think it would be feasible to put your testcase into
> tools/testing/selftests? I'm thinking "no", because running the test
> needs elevated permissions and might reboot the user's machine(!).

Yes, right. I don't think the user will be happy with that.
Unfortunately, I don't see how to test this feature without falling into
a reboot on failure. On the other side, this very specific feature is
used in the container environment and if it fails that will be spotted
immediately and fixed. So I don't think that does make sense to add this
test in tools/testing/selftests.

[ ... ]

> gid_t pid_gid;
> int hide_pid;
> + int reboot;
> };
> This was particuarly distressing. The field was poorly named and other
> people forgotting to document their data structures doesn't mean that
> we should continue to do this!

Thanks again for adding the description. I will take care next time to
add a simple description when the field name is not self-explicit or
ambiguous.

-- Daniel

2012-02-03 15:47:27

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 1/1][V5] Add reboot_pid_ns to handle the reboot syscall

Quoting Daniel Lezcano ([email protected]):
> On 02/03/2012 01:10 AM, Andrew Morton wrote:
> >On Thu, 5 Jan 2012 10:06:50 +0100
> >Daniel Lezcano<[email protected]> wrote:
> >
> >>In the case of a child pid namespace, rebooting the system does not
> >>really makes sense. When the pid namespace is used in conjunction
> >>with the other namespaces in order to create a linux container, the
> >>reboot syscall leads to some problems.
> >>
> >>A container can reboot the host. That can be fixed by dropping
> >>the sys_reboot capability but we are unable to correctly to poweroff/
> >>halt/reboot a container and the container stays stuck at the shutdown
> >>time with the container's init process waiting indefinitively.
> >>
> >>After several attempts, no solution from userspace was found to reliabily
> >>handle the shutdown from a container.
> >>
> >>This patch propose to make the init process of the child pid namespace to
> >>exit with a signal status set to : SIGINT if the child pid namespace called
> >>"halt/poweroff" and SIGHUP if the child pid namespace called "reboot".
> >>When the reboot syscall is called and we are not in the initial
> >>pid namespace, we kill the pid namespace for "HALT", "POWEROFF", "RESTART",
> >>and "RESTART2". Otherwise we return EINVAL.
> >>
> >>Returning EINVAL is also an easy way to check if this feature is supported
> >>by the kernel when invoking another 'reboot' option like CAD.
> >>
> >>By this way the parent process of the child pid namespace knows if
> >>it rebooted or not and can take the right decision.
> >Looks OK, although the comments need help. Is the below still true?
>
> Yes, thanks for fixing this.
>
> >
> >Do you think it would be feasible to put your testcase into
> >tools/testing/selftests? I'm thinking "no", because running the test
> >needs elevated permissions and might reboot the user's machine(!).
>
> Yes, right. I don't think the user will be happy with that.
> Unfortunately, I don't see how to test this feature without falling
> into a reboot on failure. On the other side, this very specific
> feature is used in the container environment and if it fails that
> will be spotted immediately and fixed. So I don't think that does
> make sense to add this test in tools/testing/selftests.

In fact I can add Daniel's testcase to my lxc testsuite to run
separately, apart from lxc. I haven't yet hooked it up into our
jenkins qa rig, but that's planned.

> [ ... ]
>
> > gid_t pid_gid;
> > int hide_pid;
> >+ int reboot;
> > };
> >This was particuarly distressing. The field was poorly named and other
> >people forgotting to document their data structures doesn't mean that
> >we should continue to do this!
>
> Thanks again for adding the description. I will take care next time
> to add a simple description when the field name is not self-explicit
> or ambiguous.
>
> -- Daniel