From: "Daniel P. Berrange" <[email protected]>
The following commit
commit cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
Author: Daniel Lezcano <[email protected]>
Date: Wed Mar 28 14:42:51 2012 -0700
pidns: add reboot_pid_ns() to handle the reboot syscall
introduced custom handling of the reboot() syscall when invoked
from a non-initial PID namespace. The intent was that a process
in a container can be allowed to keep CAP_SYS_BOOT and execute
reboot() to shutdown/reboot just their private container, rather
than the host.
Unfortunately the kexec_load() syscall also relies on the
CAP_SYS_BOOT capability. So by allowing a container to keep
this capability to safely invoke reboot(), they mistakenly
also gain the ability to use kexec_load(). The solution is
to make kexec_load() return -EPERM if invoked from a PID
namespace that is not the initial namespace
Signed-off-by: Daniel P. Berrange <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
kernel/kexec.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 0668d58..b152bde 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -947,6 +947,11 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
if (!capable(CAP_SYS_BOOT))
return -EPERM;
+ /* Processes in containers must not be allowed to load a new
+ * kernel, even if they have CAP_SYS_BOOT */
+ if (task_active_pid_ns(current) != &init_pid_ns)
+ return -EPERM;
+
/*
* Verify we have a legal set of flags
* This leaves us room for future extensions.
--
1.7.11.2
On Fri, Aug 3, 2012 at 12:53 PM, Daniel P. Berrange <[email protected]> wrote:
> From: "Daniel P. Berrange" <[email protected]>
>
> The following commit
>
> commit cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
> Author: Daniel Lezcano <[email protected]>
> Date: Wed Mar 28 14:42:51 2012 -0700
>
> pidns: add reboot_pid_ns() to handle the reboot syscall
>
> introduced custom handling of the reboot() syscall when invoked
> from a non-initial PID namespace. The intent was that a process
> in a container can be allowed to keep CAP_SYS_BOOT and execute
> reboot() to shutdown/reboot just their private container, rather
> than the host.
>
> Unfortunately the kexec_load() syscall also relies on the
> CAP_SYS_BOOT capability. So by allowing a container to keep
> this capability to safely invoke reboot(), they mistakenly
> also gain the ability to use kexec_load(). The solution is
> to make kexec_load() return -EPERM if invoked from a PID
> namespace that is not the initial namespace
>
> Signed-off-by: Daniel P. Berrange <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Cc: Daniel Lezcano <[email protected]>
> Cc: Michael Kerrisk <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
Why is this commit not marked for -stable?
--
Thanks,
//richard
The solution is to use user namespaces and to only test ns_capable on the magic reboot path.
For the 3.7 timeframe that should be a realistic solution.
Eric
On Fri, Aug 03, 2012 at 05:45:40AM -0700, Eric W. Biederman wrote:
> The solution is to use user namespaces and to only test ns_capable on the magic reboot path.
>
> For the 3.7 timeframe that should be a realistic solution.
Hmm, that would imply that if LXC wants to allow reboot()/CAP_SYS_BOOT
they will be forced to use CLONE_NEWUSER. I was rather looking for a way
to allow the container to keep CAP_SYS_BOOT, without also mandating use
of user namespaces.
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
"Daniel P. Berrange" <[email protected]> wrote:
>On Fri, Aug 03, 2012 at 05:45:40AM -0700, Eric W. Biederman wrote:
>> The solution is to use user namespaces and to only test ns_capable on
>the magic reboot path.
>>
>> For the 3.7 timeframe that should be a realistic solution.
>
>Hmm, that would imply that if LXC wants to allow reboot()/CAP_SYS_BOOT
>they will be forced to use CLONE_NEWUSER. I was rather looking for a
>way
>to allow the container to keep CAP_SYS_BOOT, without also mandating use
>of user namespaces.
If we remove the use of CAP_SYS_BOOT on the container reboot path perhaps.
But you have hit one small issue in the huge pile of issues why giving contaners capabilities is generally a bad idea.
This is the reason I have been insisting on a reasonable version of user namespaces for a long time.
When the security issues become important it is time for user namespaces. That is their purpose.
Eric
Eric,
during the container reboot discussion, the agreement was reached that rebooting for real fron non-init pid ns is not safe. Restarting userspace (in pidns caller owns) is. I argue the same reasoning supports this.
I haven't had a chance to review the patch, but the idea gets my ack. I'll look at the patch asap.
I'm also fine with splitting cap_sys_boot into a user and system caps. The former would only be needed targeted to the userns of the init pid, while the latter would be required to init_user_ns. Then containers could safely be given cap_sys_restart or whatever, but not cap_sys_boot which authorizes kexec and machine reset/poweroff.
----- Original message -----
> "Daniel P. Berrange" <[email protected]> wrote:
>
> > On Fri, Aug 03, 2012 at 05:45:40AM -0700, Eric W. Biederman wrote:
> > > The solution is to use user namespaces and to only test ns_capable on
> > the magic reboot path.
> > >
> > > For the 3.7 timeframe that should be a realistic solution.
> >
> > Hmm, that would imply that if LXC wants to allow reboot()/CAP_SYS_BOOT
> > they will be forced to use CLONE_NEWUSER. I was rather looking for a
> > way
> > to allow the container to keep CAP_SYS_BOOT, without also mandating use
> > of user namespaces.
>
> If we remove the use of CAP_SYS_BOOT on the container reboot path
> perhaps.
>
> But you have hit one small issue in the huge pile of issues why giving
> contaners capabilities is generally a bad idea.
>
> This is the reason I have been insisting on a reasonable version of user
> namespaces for a long time.
>
> When the security issues become important it is time for user
> namespaces. That is their purpose.
>
> Eric
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Quoting Daniel P. Berrange ([email protected]):
> From: "Daniel P. Berrange" <[email protected]>
>
> The following commit
>
> commit cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
> Author: Daniel Lezcano <[email protected]>
> Date: Wed Mar 28 14:42:51 2012 -0700
>
> pidns: add reboot_pid_ns() to handle the reboot syscall
>
> introduced custom handling of the reboot() syscall when invoked
> from a non-initial PID namespace. The intent was that a process
> in a container can be allowed to keep CAP_SYS_BOOT and execute
> reboot() to shutdown/reboot just their private container, rather
> than the host.
>
> Unfortunately the kexec_load() syscall also relies on the
> CAP_SYS_BOOT capability. So by allowing a container to keep
> this capability to safely invoke reboot(), they mistakenly
> also gain the ability to use kexec_load(). The solution is
> to make kexec_load() return -EPERM if invoked from a PID
> namespace that is not the initial namespace
>
> Signed-off-by: Daniel P. Berrange <[email protected]>
> Cc: Serge Hallyn <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
(Please see my previous email explaining why I believe the pidns
is an appropriate check)
> Cc: Daniel Lezcano <[email protected]>
> Cc: Michael Kerrisk <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> ---
> kernel/kexec.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 0668d58..b152bde 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -947,6 +947,11 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
> if (!capable(CAP_SYS_BOOT))
> return -EPERM;
>
> + /* Processes in containers must not be allowed to load a new
> + * kernel, even if they have CAP_SYS_BOOT */
> + if (task_active_pid_ns(current) != &init_pid_ns)
> + return -EPERM;
> +
> /*
> * Verify we have a legal set of flags
> * This leaves us room for future extensions.
> --
> 1.7.11.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Daniel P. Berrange ([email protected]):
>> From: "Daniel P. Berrange" <[email protected]>
>>
>> The following commit
>>
>> commit cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
>> Author: Daniel Lezcano <[email protected]>
>> Date: Wed Mar 28 14:42:51 2012 -0700
>>
>> pidns: add reboot_pid_ns() to handle the reboot syscall
>>
>> introduced custom handling of the reboot() syscall when invoked
>> from a non-initial PID namespace. The intent was that a process
>> in a container can be allowed to keep CAP_SYS_BOOT and execute
>> reboot() to shutdown/reboot just their private container, rather
>> than the host.
>>
>> Unfortunately the kexec_load() syscall also relies on the
>> CAP_SYS_BOOT capability. So by allowing a container to keep
>> this capability to safely invoke reboot(), they mistakenly
>> also gain the ability to use kexec_load(). The solution is
>> to make kexec_load() return -EPERM if invoked from a PID
>> namespace that is not the initial namespace
>>
>> Signed-off-by: Daniel P. Berrange <[email protected]>
>> Cc: Serge Hallyn <[email protected]>
>
> Acked-by: Serge Hallyn <[email protected]>
>
> (Please see my previous email explaining why I believe the pidns
> is an appropriate check)
Serge as to your objects.
If we define kexec_load in terms of the pid namespace then something
makes sense, but the error should be EINVAL, or something of that sort.
That is what we did with reboot. We defined reboot in terms of the pid
namespace.
Not defining kexec_load in terms of the pid namespace and then returning
EPERM because having we happen to have CAP_SYS_BOOT for other reasons is
semantically horrible.
At the end of the day the effect is the same, but I think it matters a
great deal in how we think about things.
We have CAP_SYS_BOOT in the initial user namespace. We do have
permission to make the system call.
So I continue to see this patch the way it is current constructed as
broken.
Nacked-by: "Eric W. Biederman" <[email protected]>
Eric
Quoting Serge Hallyn ([email protected]):
> Eric,
>
> during the container reboot discussion, the agreement was reached that rebooting for real fron non-init pid ns is not safe. Restarting userspace (in pidns caller owns) is. I argue the same reasoning supports this.
>
> I haven't had a chance to review the patch, but the idea gets my ack. I'll look at the patch asap.
>
> I'm also fine with splitting cap_sys_boot into a user and system caps. The former would only be needed targeted to the userns of the init pid, while the latter would be required to init_user_ns. Then containers could safely be given cap_sys_restart or whatever, but not cap_sys_boot which authorizes kexec and machine reset/poweroff.
Splitting the cap up into CAP_RESTART (restart /sbin/init) and CAP_BOOT
(reboot hardware or kexec kernel) has the advantage that the capabilities
each remain simpler to parse, no 'in this context it means that'.
Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Quoting Daniel P. Berrange ([email protected]):
> >> From: "Daniel P. Berrange" <[email protected]>
> >>
> >> The following commit
> >>
> >> commit cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
> >> Author: Daniel Lezcano <[email protected]>
> >> Date: Wed Mar 28 14:42:51 2012 -0700
> >>
> >> pidns: add reboot_pid_ns() to handle the reboot syscall
> >>
> >> introduced custom handling of the reboot() syscall when invoked
> >> from a non-initial PID namespace. The intent was that a process
> >> in a container can be allowed to keep CAP_SYS_BOOT and execute
> >> reboot() to shutdown/reboot just their private container, rather
> >> than the host.
> >>
> >> Unfortunately the kexec_load() syscall also relies on the
> >> CAP_SYS_BOOT capability. So by allowing a container to keep
> >> this capability to safely invoke reboot(), they mistakenly
> >> also gain the ability to use kexec_load(). The solution is
> >> to make kexec_load() return -EPERM if invoked from a PID
> >> namespace that is not the initial namespace
> >>
> >> Signed-off-by: Daniel P. Berrange <[email protected]>
> >> Cc: Serge Hallyn <[email protected]>
> >
> > Acked-by: Serge Hallyn <[email protected]>
> >
> > (Please see my previous email explaining why I believe the pidns
> > is an appropriate check)
>
> Serge as to your objects.
>
> If we define kexec_load in terms of the pid namespace then something
> makes sense, but the error should be EINVAL, or something of that sort.
Makes sense.
> That is what we did with reboot. We defined reboot in terms of the pid
> namespace.
>
> Not defining kexec_load in terms of the pid namespace and then returning
> EPERM because having we happen to have CAP_SYS_BOOT for other reasons is
> semantically horrible.
>
> At the end of the day the effect is the same, but I think it matters a
> great deal in how we think about things.
>
> We have CAP_SYS_BOOT in the initial user namespace. We do have
> permission to make the system call.
>
> So I continue to see this patch the way it is current constructed as
> broken.
>
> Nacked-by: "Eric W. Biederman" <[email protected]>
I do also prefer splitting the capability. Michael Kerrisk, do you
have any good suggestions for better names than CAP_RESTART (for
killing or restarting /sbin/init) and CAP_BOOT (for kexec and/or
hardware resets)? Maybe CAP_RESTART_USER and CAP_RESTART_HW?
(CAP_SYS_BOOT being an alias for both for backward compatibility)