2023-09-29 21:47:11

by Brian Geffon

[permalink] [raw]
Subject: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

When the system has a frozen userspace, for example, during hibernation
the child reaper task will also be frozen. Attmepting to deliver a
signal to it to handle the reboot(2) will ultimately lead to the system
hanging unless userspace is thawed.

This change checks if the current task is the suspending task and if so
it will allow it to proceed with a reboot from the non-init pid ns.

Signed-off-by: Brian Geffon <[email protected]>
Reported-by: Matthias Kaehlcke <[email protected]>
Tested-by: Matthias Kaehlcke <[email protected]>
---
kernel/pid_namespace.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0bf44afe04dd..4a93a5063eda 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -321,6 +321,15 @@ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
if (pid_ns == &init_pid_ns)
return 0;

+ if (current->flags & PF_SUSPEND_TASK) {
+ /*
+ * Attempting to signal the child_reaper won't work if it's
+ * frozen. In this case we shutdown the system as if we were in
+ * the init_pid_ns.
+ */
+ return 0;
+ }
+
switch (cmd) {
case LINUX_REBOOT_CMD_RESTART2:
case LINUX_REBOOT_CMD_RESTART:
--
2.42.0.582.g8ccd20d70d-goog


2023-09-29 21:56:29

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

On Fri, Sep 29, 2023 at 01:44:42PM -0400, Brian Geffon wrote:
> When the system has a frozen userspace, for example, during hibernation
> the child reaper task will also be frozen. Attmepting to deliver a
> signal to it to handle the reboot(2) will ultimately lead to the system
> hanging unless userspace is thawed.
>
> This change checks if the current task is the suspending task and if so
> it will allow it to proceed with a reboot from the non-init pid ns.

I don't know the code flow too well here, but shouldn't init_pid_ns
always be doing the reboot regardless of anything else?

Also how is this syscall running if current is frozen? This feels weird
to me... shouldn't the frozen test be against pid_ns->child_reaper
instead of current?

-Kees

--
Kees Cook

2023-09-30 10:33:23

by Brian Geffon

[permalink] [raw]
Subject: Re: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

On Fri, Sep 29, 2023 at 4:09 PM Kees Cook <[email protected]> wrote:
>
> On Fri, Sep 29, 2023 at 01:44:42PM -0400, Brian Geffon wrote:
> > When the system has a frozen userspace, for example, during hibernation
> > the child reaper task will also be frozen. Attmepting to deliver a
> > signal to it to handle the reboot(2) will ultimately lead to the system
> > hanging unless userspace is thawed.
> >
> > This change checks if the current task is the suspending task and if so
> > it will allow it to proceed with a reboot from the non-init pid ns.
>
> I don't know the code flow too well here, but shouldn't init_pid_ns
> always be doing the reboot regardless of anything else?

I think the point of this is, normally the reaper is runnable and so
an appropriate signal will be delivered allowing them to also clean up
[2]. In our case, they won't be runnable and doing this wouldn't make
sense.

>
> Also how is this syscall running if current is frozen? This feels weird
> to me... shouldn't the frozen test be against pid_ns->child_reaper
> instead of current?

The task which froze the system won't be frozen to make sure this
happens it will have the flag PF_SUSPEND_TASK added, so we know if we
have this flag we're the only running user space task [1].

>
> -Kees

I hope my understanding is correct and it makes sense. Thanks for
taking the time to review.

Brian

1. https://elixir.bootlin.com/linux/latest/source/kernel/power/process.c#L130
2. https://elixir.bootlin.com/linux/latest/source/kernel/pid_namespace.c#L327


>
> --
> Kees Cook

2023-10-09 20:06:10

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

On Fri, Sep 29, 2023 at 08:25:42PM -0400, Brian Geffon wrote:
> On Fri, Sep 29, 2023 at 4:09 PM Kees Cook <[email protected]> wrote:
> >
> > On Fri, Sep 29, 2023 at 01:44:42PM -0400, Brian Geffon wrote:
> > > When the system has a frozen userspace, for example, during hibernation
> > > the child reaper task will also be frozen. Attmepting to deliver a
> > > signal to it to handle the reboot(2) will ultimately lead to the system
> > > hanging unless userspace is thawed.
> > >
> > > This change checks if the current task is the suspending task and if so
> > > it will allow it to proceed with a reboot from the non-init pid ns.
> >
> > I don't know the code flow too well here, but shouldn't init_pid_ns
> > always be doing the reboot regardless of anything else?
>
> I think the point of this is, normally the reaper is runnable and so
> an appropriate signal will be delivered allowing them to also clean up
> [2]. In our case, they won't be runnable and doing this wouldn't make
> sense.
>
> >
> > Also how is this syscall running if current is frozen? This feels weird
> > to me... shouldn't the frozen test be against pid_ns->child_reaper
> > instead of current?
>
> The task which froze the system won't be frozen to make sure this
> happens it will have the flag PF_SUSPEND_TASK added, so we know if we
> have this flag we're the only running user space task [1].
>
> >
> > -Kees
>
> I hope my understanding is correct and it makes sense. Thanks for
> taking the time to review.

Christian, is this something you can take a look at? I'm much less
familiar with this area of the process list logic.

-Kees

--
Kees Cook

2023-10-12 03:58:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

Brian Geffon <[email protected]> writes:

> On Fri, Sep 29, 2023 at 4:09 PM Kees Cook <[email protected]> wrote:
>>
>> On Fri, Sep 29, 2023 at 01:44:42PM -0400, Brian Geffon wrote:
>> > When the system has a frozen userspace, for example, during hibernation
>> > the child reaper task will also be frozen. Attmepting to deliver a
>> > signal to it to handle the reboot(2) will ultimately lead to the system
>> > hanging unless userspace is thawed.
>> >
>> > This change checks if the current task is the suspending task and if so
>> > it will allow it to proceed with a reboot from the non-init pid ns.
>>
>> I don't know the code flow too well here, but shouldn't init_pid_ns
>> always be doing the reboot regardless of anything else?
>
> I think the point of this is, normally the reaper is runnable and so
> an appropriate signal will be delivered allowing them to also clean up
> [2]. In our case, they won't be runnable and doing this wouldn't make
> sense.

The entire reboot_pid_ns thing is just a polite way of keeping
applications like /sbin/reboot working inside a pid namespace.

Ordinarily the process calling reboot (inside the container) won't
have the privileges to request an entire system reboot. So I don't
see anything making sense to promote that reboot into a system-wide
reboot.

Which leads me to the question. What is actually happening with
hibernation that we want something inside a pid namespace to somehow
have the permissions to reboot the entire machine?

>> Also how is this syscall running if current is frozen? This feels weird
>> to me... shouldn't the frozen test be against pid_ns->child_reaper
>> instead of current?
>
> The task which froze the system won't be frozen to make sure this
> happens it will have the flag PF_SUSPEND_TASK added, so we know if we
> have this flag we're the only running user space task [1].

Someone has a task inside a container that is successfully suspending
the entire system?

I don't see how that makes sense.

But on the level that it somehow does I would put a test in
kernel/reboot.c something like:

/*
* If the caller can't perform a normal reboot call
* reboot_pid_ns
*/
if ((pid_ns != &init_pid_ns) &&
!((current->flags & PF_SUSPEND_TASK) && capable(CAP_SYS_BOOT))) {
return reboot_pid_ns(pid_ns, cmd);
}

Making reboot_pid_ns responsible for the logic that should be bypassing
it is quite confusing.

> I hope my understanding is correct and it makes sense. Thanks for
> taking the time to review.
>
> Brian
>
> 1. https://elixir.bootlin.com/linux/latest/source/kernel/power/process.c#L130
> 2. https://elixir.bootlin.com/linux/latest/source/kernel/pid_namespace.c#L327


I really don't know if allowing PF_SUSPEND_TASK so that hibernation and
the like can work from inside a container makes any sense at all.

But the above is roughly how I would make it work.

Eric

2023-10-12 09:48:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

On Fri, Sep 29, 2023 at 7:45 PM Brian Geffon <[email protected]> wrote:
>
> When the system has a frozen userspace, for example, during hibernation
> the child reaper task will also be frozen. Attmepting to deliver a
> signal to it to handle the reboot(2) will ultimately lead to the system
> hanging unless userspace is thawed.
>
> This change checks if the current task is the suspending task and if so
> it will allow it to proceed with a reboot from the non-init pid ns.
>
> Signed-off-by: Brian Geffon <[email protected]>
> Reported-by: Matthias Kaehlcke <[email protected]>
> Tested-by: Matthias Kaehlcke <[email protected]>

If the report is public, which I think is the case, having a Link: tag
pointing to it here would be nice.

> ---
> kernel/pid_namespace.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0bf44afe04dd..4a93a5063eda 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -321,6 +321,15 @@ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> if (pid_ns == &init_pid_ns)
> return 0;
>
> + if (current->flags & PF_SUSPEND_TASK) {
> + /*
> + * Attempting to signal the child_reaper won't work if it's
> + * frozen. In this case we shutdown the system as if we were in
> + * the init_pid_ns.
> + */

Is the system guaranteed to be in the right state for a shutdown at this point?

There is a system-wide suspend-resume or hibernation in progress, so
system_transition_mutex should be held and that should cause reboot()
to block anyway. Do you know why it doesn't block and why the suspend
task has any reason to call it?

> + return 0;
> + }
> +
> switch (cmd) {
> case LINUX_REBOOT_CMD_RESTART2:
> case LINUX_REBOOT_CMD_RESTART:
> --
> 2.42.0.582.g8ccd20d70d-goog
>

2023-10-17 19:01:03

by Brian Geffon

[permalink] [raw]
Subject: Re: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

On Thu, Oct 12, 2023 at 5:48 AM Rafael J. Wysocki <[email protected]> wrote:
>
> On Fri, Sep 29, 2023 at 7:45 PM Brian Geffon <[email protected]> wrote:
> >
> > When the system has a frozen userspace, for example, during hibernation
> > the child reaper task will also be frozen. Attmepting to deliver a
> > signal to it to handle the reboot(2) will ultimately lead to the system
> > hanging unless userspace is thawed.
> >
> > This change checks if the current task is the suspending task and if so
> > it will allow it to proceed with a reboot from the non-init pid ns.
> >
> > Signed-off-by: Brian Geffon <[email protected]>
> > Reported-by: Matthias Kaehlcke <[email protected]>
> > Tested-by: Matthias Kaehlcke <[email protected]>
>
> If the report is public, which I think is the case, having a Link: tag
> pointing to it here would be nice.
>
> > ---
> > kernel/pid_namespace.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index 0bf44afe04dd..4a93a5063eda 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -321,6 +321,15 @@ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> > if (pid_ns == &init_pid_ns)
> > return 0;
> >
> > + if (current->flags & PF_SUSPEND_TASK) {
> > + /*
> > + * Attempting to signal the child_reaper won't work if it's
> > + * frozen. In this case we shutdown the system as if we were in
> > + * the init_pid_ns.
> > + */
>
> Is the system guaranteed to be in the right state for a shutdown at this point?
>
> There is a system-wide suspend-resume or hibernation in progress, so
> system_transition_mutex should be held and that should cause reboot()
> to block anyway. Do you know why it doesn't block and why the suspend
> task has any reason to call it?
>

Sorry for the delay in responding to these questions, I'm going to do
another pass through this code and respond with a more detailed
explanation in the next few days.

> > + return 0;
> > + }
> > +
> > switch (cmd) {
> > case LINUX_REBOOT_CMD_RESTART2:
> > case LINUX_REBOOT_CMD_RESTART:
> > --
> > 2.42.0.582.g8ccd20d70d-goog
> >