Hi,
This fixes the signal refactoring to actually kill unkillable processes
when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
report and Eric for the fix! I've also tweaked seccomp internal a bit to
fail more safely. This was a partial seccomp bypass, in the sense that
SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
aspects of the filters. (i.e. the syscall was still blocked, etc.)
I'll be sending this to Linus after a bit more testing...
Thanks,
-Kees
Kees Cook (3):
signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
seccomp: Invalidate seccomp mode to catch death failures
samples/seccomp: Adjust sample to also provide kill option
kernel/seccomp.c | 10 ++++++++++
kernel/signal.c | 5 +++--
samples/seccomp/dropper.c | 9 +++++++--
3 files changed, 20 insertions(+), 4 deletions(-)
--
2.30.2
Fatal SIGSYS signals were not being delivered to pid namespace init
processes. Make sure the SIGNAL_UNKILLABLE doesn't get set for these
cases.
Reported-by: Robert Święcki <[email protected]>
Suggested-by: "Eric W. Biederman" <[email protected]>
Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
---
kernel/signal.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 38602738866e..33e3ee4f3383 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
}
/*
* Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
- * debugging to leave init killable.
+ * debugging to leave init killable, unless it is intended to exit.
*/
- if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
+ if (action->sa.sa_handler == SIG_DFL &&
+ (!t->ptrace || (handler == HANDLER_EXIT)))
t->signal->flags &= ~SIGNAL_UNKILLABLE;
ret = send_signal(sig, info, t, PIDTYPE_PID);
spin_unlock_irqrestore(&t->sighand->siglock, flags);
--
2.30.2
On Thu, Feb 10, 2022 at 6:37 PM Kees Cook <[email protected]> wrote:
> On Thu, Feb 10, 2022 at 05:18:39PM +0100, Jann Horn wrote:
> > On Thu, Feb 10, 2022 at 3:53 AM Kees Cook <[email protected]> wrote:
> > > Fatal SIGSYS signals were not being delivered to pid namespace init
> > > processes. Make sure the SIGNAL_UNKILLABLE doesn't get set for these
> > > cases.
> > >
> > > Reported-by: Robert Święcki <[email protected]>
> > > Suggested-by: "Eric W. Biederman" <[email protected]>
> > > Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
> > > Cc: [email protected]
> > > Signed-off-by: Kees Cook <[email protected]>
> > > ---
> > > kernel/signal.c | 5 +++--
> > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index 38602738866e..33e3ee4f3383 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
> > > }
> > > /*
> > > * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
> > > - * debugging to leave init killable.
> > > + * debugging to leave init killable, unless it is intended to exit.
> > > */
> > > - if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
> > > + if (action->sa.sa_handler == SIG_DFL &&
> > > + (!t->ptrace || (handler == HANDLER_EXIT)))
> > > t->signal->flags &= ~SIGNAL_UNKILLABLE;
> >
> > You're changing the subclause:
> >
> > !t->ptrace
> >
> > to:
> >
> > (!t->ptrace || (handler == HANDLER_EXIT))
> >
> > which means that the change only affects cases where the process has a
> > ptracer, right? That's not the scenario the commit message is talking
> > about...
>
> Sorry, yes, I was not as accurate as I should have been in the commit
> log. I have changed it to:
>
> Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
> were not being delivered to ptraced pid namespace init processes. Make
> sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
So basically force_sig_info() is trying to figure out whether
get_signal() will later on check for SIGNAL_UNKILLABLE (the SIG_DFL
case), and if so, it clears the flag from the target's signal_struct
that marks the process as unkillable?
This used to be:
if (action->sa.sa_handler == SIG_DFL)
t->signal->flags &= ~SIGNAL_UNKILLABLE;
Then someone noticed that in the ptrace case, the signal might not
actually end up being consumed by the target process, and added the
"&& !t->ptrace" clause in commit
eb61b5911bdc923875cde99eb25203a0e2b06d43.
And now Robert Swiecki noticed that that still didn't accurately model
what'll happen in get_signal().
This seems hacky to me, and also racy: What if, while you're going
through a SECCOMP_RET_KILL_PROCESS in an unkillable process, some
other thread e.g. concurrently changes the disposition of SIGSYS from
a custom handler to SIG_DFL?
Instead of trying to figure out whether the signal would have been
fatal without SIGNAL_UNKILLABLE, I think it would be better to find a
way to tell the signal-handling code that SIGNAL_UNKILLABLE should be
bypassed for this specific signal, or something along those lines...
but of course that's also kind of messy because the signal-sending
code might fall back to just using the pending signal mask on
allocation failure IIRC?
Kees Cook <[email protected]> writes:
> On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
>> Kees Cook <[email protected]> writes:
>>
>> > Hi,
>> >
>> > This fixes the signal refactoring to actually kill unkillable processes
>> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
>> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
>> > fail more safely. This was a partial seccomp bypass, in the sense that
>> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
>> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
>>
>> Any luck on figuring out how to suppress the extra event?
>
> I haven't found a good single indicator of a process being in an "I am dying"
> state, and even if I did, it seems every architecture's exit path would
> need to add a new test.
The "I am dying" state for a task is fatal_signal_pending, at least
before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT.
Something I am busily cleaning up and making more reliable at the
moment.
What is the event that is happening? Is it
tracehook_report_syscall_exit or something else?
From the bits I have seen it seems like something else.
> The best approach seems to be clearing the TIF_*WORK* bits, but that's
> still a bit arch-specific. And I'm not sure which layer would do that.
> At what point have we decided the process will not continue? More
> than seccomp was calling do_exit() in the middle of a syscall, but those
> appear to have all been either SIGKILL or SIGSEGV?
This is where I get confused what TIF_WORK bits matter?
I expect if anything else mattered we would need to change it to
HANDLER_EXIT.
I made a mistake conflating to cases and I want to make certain I
successfully separate those two cases at the end of the day.
Eric
On Thu, Feb 10, 2022 at 05:18:39PM +0100, Jann Horn wrote:
> On Thu, Feb 10, 2022 at 3:53 AM Kees Cook <[email protected]> wrote:
> > Fatal SIGSYS signals were not being delivered to pid namespace init
> > processes. Make sure the SIGNAL_UNKILLABLE doesn't get set for these
> > cases.
> >
> > Reported-by: Robert Święcki <[email protected]>
> > Suggested-by: "Eric W. Biederman" <[email protected]>
> > Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
> > Cc: [email protected]
> > Signed-off-by: Kees Cook <[email protected]>
> > ---
> > kernel/signal.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 38602738866e..33e3ee4f3383 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
> > }
> > /*
> > * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
> > - * debugging to leave init killable.
> > + * debugging to leave init killable, unless it is intended to exit.
> > */
> > - if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
> > + if (action->sa.sa_handler == SIG_DFL &&
> > + (!t->ptrace || (handler == HANDLER_EXIT)))
> > t->signal->flags &= ~SIGNAL_UNKILLABLE;
>
> You're changing the subclause:
>
> !t->ptrace
>
> to:
>
> (!t->ptrace || (handler == HANDLER_EXIT))
>
> which means that the change only affects cases where the process has a
> ptracer, right? That's not the scenario the commit message is talking
> about...
Sorry, yes, I was not as accurate as I should have been in the commit
log. I have changed it to:
Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
were not being delivered to ptraced pid namespace init processes. Make
sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
>
> > ret = send_signal(sig, info, t, PIDTYPE_PID);
> > spin_unlock_irqrestore(&t->sighand->siglock, flags);
> > --
> > 2.30.2
> >
--
Kees Cook
On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
> Kees Cook <[email protected]> writes:
>
> > Hi,
> >
> > This fixes the signal refactoring to actually kill unkillable processes
> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
> > fail more safely. This was a partial seccomp bypass, in the sense that
> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
>
> Any luck on figuring out how to suppress the extra event?
I haven't found a good single indicator of a process being in an "I am dying"
state, and even if I did, it seems every architecture's exit path would
need to add a new test.
The best approach seems to be clearing the TIF_*WORK* bits, but that's
still a bit arch-specific. And I'm not sure which layer would do that.
At what point have we decided the process will not continue? More
than seccomp was calling do_exit() in the middle of a syscall, but those
appear to have all been either SIGKILL or SIGSEGV?
--
Kees Cook
On Thu, Feb 10, 2022 at 07:01:39PM +0100, Jann Horn wrote:
> On Thu, Feb 10, 2022 at 6:37 PM Kees Cook <[email protected]> wrote:
> > On Thu, Feb 10, 2022 at 05:18:39PM +0100, Jann Horn wrote:
> > > On Thu, Feb 10, 2022 at 3:53 AM Kees Cook <[email protected]> wrote:
> > > > Fatal SIGSYS signals were not being delivered to pid namespace init
> > > > processes. Make sure the SIGNAL_UNKILLABLE doesn't get set for these
> > > > cases.
> > > >
> > > > Reported-by: Robert Święcki <[email protected]>
> > > > Suggested-by: "Eric W. Biederman" <[email protected]>
> > > > Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
> > > > Cc: [email protected]
> > > > Signed-off-by: Kees Cook <[email protected]>
> > > > ---
> > > > kernel/signal.c | 5 +++--
> > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > > index 38602738866e..33e3ee4f3383 100644
> > > > --- a/kernel/signal.c
> > > > +++ b/kernel/signal.c
> > > > @@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
> > > > }
> > > > /*
> > > > * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
> > > > - * debugging to leave init killable.
> > > > + * debugging to leave init killable, unless it is intended to exit.
> > > > */
> > > > - if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
> > > > + if (action->sa.sa_handler == SIG_DFL &&
> > > > + (!t->ptrace || (handler == HANDLER_EXIT)))
> > > > t->signal->flags &= ~SIGNAL_UNKILLABLE;
> > >
> > > You're changing the subclause:
> > >
> > > !t->ptrace
> > >
> > > to:
> > >
> > > (!t->ptrace || (handler == HANDLER_EXIT))
> > >
> > > which means that the change only affects cases where the process has a
> > > ptracer, right? That's not the scenario the commit message is talking
> > > about...
> >
> > Sorry, yes, I was not as accurate as I should have been in the commit
> > log. I have changed it to:
> >
> > Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
> > were not being delivered to ptraced pid namespace init processes. Make
> > sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
>
> So basically force_sig_info() is trying to figure out whether
> get_signal() will later on check for SIGNAL_UNKILLABLE (the SIG_DFL
> case), and if so, it clears the flag from the target's signal_struct
> that marks the process as unkillable?
>
> This used to be:
>
> if (action->sa.sa_handler == SIG_DFL)
> t->signal->flags &= ~SIGNAL_UNKILLABLE;
>
> Then someone noticed that in the ptrace case, the signal might not
> actually end up being consumed by the target process, and added the
> "&& !t->ptrace" clause in commit
> eb61b5911bdc923875cde99eb25203a0e2b06d43.
>
> And now Robert Swiecki noticed that that still didn't accurately model
> what'll happen in get_signal().
>
> This seems hacky to me, and also racy: What if, while you're going
> through a SECCOMP_RET_KILL_PROCESS in an unkillable process, some
> other thread e.g. concurrently changes the disposition of SIGSYS from
> a custom handler to SIG_DFL?
Do you mean after force_sig_info_to_task() has finished but before
get_signal()? SA_IMMUTABLE will block changes to the action.
If you mean before force_sig_info_to_task(), I don't see how that's
possible since it's under lock:
if (blocked || ignored || (handler != HANDLER_CURRENT)) {
action->sa.sa_handler = SIG_DFL;
if (handler == HANDLER_EXIT)
action->sa.sa_flags |= SA_IMMUTABLE;
...
if (action->sa.sa_handler == SIG_DFL &&
(!t->ptrace || (handler == HANDLER_EXIT)))
t->signal->flags &= ~SIGNAL_UNKILLABLE;
Given handler = HANDLER_EXIT, it'll always be SIG_DFL.
> Instead of trying to figure out whether the signal would have been
> fatal without SIGNAL_UNKILLABLE, I think it would be better to find a
> way to tell the signal-handling code that SIGNAL_UNKILLABLE should be
> bypassed for this specific signal, or something along those lines...
> but of course that's also kind of messy because the signal-sending
> code might fall back to just using the pending signal mask on
> allocation failure IIRC?
My original patch aimed that way:
diff --git a/kernel/signal.c b/kernel/signal.c
index 9b04631acde8..c124a09de6de 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2787,7 +2787,8 @@ bool get_signal(struct ksignal *ksig)
* case, the signal cannot be dropped.
*/
if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
- !sig_kernel_only(signr))
+ !sig_kernel_only(signr) &&
+ !(ka->sa.sa_flags & SA_IMMUTABLE))
continue;
if (sig_kernel_stop(signr)) {
But I don't think there's a race, and Eric's suggestion seemed
better in the sense that the state change is entirely contained by
force_sig_info_to_task().
-Kees
--
Kees Cook
Kees Cook <[email protected]> writes:
> Fatal SIGSYS signals were not being delivered to pid namespace init
> processes . Make sure the SIGNAL_UNKILLABLE doesn't get set for these
^ when ptraced.
> cases.
Reviewed-by: "Eric W. Biederman" <[email protected]>
I have pointed out a few nits in the wording but otherwise this looks good.
>
> Reported-by: Robert Święcki <[email protected]>
> Suggested-by: "Eric W. Biederman" <[email protected]>
> Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
> Cc: [email protected]
> Signed-off-by: Kees Cook <[email protected]>
> ---
> kernel/signal.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 38602738866e..33e3ee4f3383 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
> }
> /*
> * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
> - * debugging to leave init killable.
> + * debugging to leave init killable, unless it is intended to exit.
perhaps ^ HANDLER_EXIT is always fatal.
> */
> - if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
> + if (action->sa.sa_handler == SIG_DFL &&
> + (!t->ptrace || (handler == HANDLER_EXIT)))
> t->signal->flags &= ~SIGNAL_UNKILLABLE;
> ret = send_signal(sig, info, t, PIDTYPE_PID);
> spin_unlock_irqrestore(&t->sighand->siglock, flags);
Eric
Kees Cook <[email protected]> writes:
> Hi,
>
> This fixes the signal refactoring to actually kill unkillable processes
> when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
> report and Eric for the fix! I've also tweaked seccomp internal a bit to
> fail more safely. This was a partial seccomp bypass, in the sense that
> SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
> aspects of the filters. (i.e. the syscall was still blocked, etc.)
Any luck on figuring out how to suppress the extra event?
>
> I'll be sending this to Linus after a bit more testing...
>
> Thanks,
>
> -Kees
>
> Kees Cook (3):
> signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
> seccomp: Invalidate seccomp mode to catch death failures
> samples/seccomp: Adjust sample to also provide kill option
>
> kernel/seccomp.c | 10 ++++++++++
> kernel/signal.c | 5 +++--
> samples/seccomp/dropper.c | 9 +++++++--
> 3 files changed, 20 insertions(+), 4 deletions(-)
Eric
On Thu, Feb 10, 2022 at 04:48:30PM -0600, Eric W. Biederman wrote:
> Kees Cook <[email protected]> writes:
>
> > On Thu, Feb 10, 2022 at 12:58:07PM -0600, Eric W. Biederman wrote:
> >> Kees Cook <[email protected]> writes:
> >>
> >> > On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
> >> >> Kees Cook <[email protected]> writes:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > This fixes the signal refactoring to actually kill unkillable processes
> >> >> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
> >> >> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
> >> >> > fail more safely. This was a partial seccomp bypass, in the sense that
> >> >> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
> >> >> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
> >> >>
> >> >> Any luck on figuring out how to suppress the extra event?
> >> >
> >> > I haven't found a good single indicator of a process being in an "I am dying"
> >> > state, and even if I did, it seems every architecture's exit path would
> >> > need to add a new test.
> >>
> >> The "I am dying" state for a task is fatal_signal_pending, at least
> >> before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT.
> >> Something I am busily cleaning up and making more reliable at the
> >> moment.
> >
> > The state I need to catch is "I am dying and this syscall was
> > interrupted". fatal_signal_pending() is kind of only the first half
> > (though it doesn't cover fatal SIGSYS?)
> >
> > For example, if a process hits a BUG() in the middle of running a
> > syscall, that syscall isn't expected to "exit" from the perspective of
> > userspace. This is similarly true for seccomp's fatal SIGSYS.
> >
> >> What is the event that is happening? Is it
> >> tracehook_report_syscall_exit or something else?
> >
> > Yes, but in more completely, it's these three, which are called in
> > various fashions from architecture syscall exit code:
> >
> > audit_syscall_exit() (audit)
> > trace_sys_exit() (see "TRACE_EVENT_FN(sys_exit,")
> > tracehook_report_syscall_exit() (ptrace)
> >
> >> From the bits I have seen it seems like something else.
> >
> > But yes, the place Robert and I both noticed it was with ptrace from
> > tracehook_report_syscall_exit(), which is rather poorly named. :)
>
> Speaking of patches I am just about to send out.
>
> > Looking at the results, audit_syscall_exit() and trace_sys_exit() need
> > to be skipped too, since they would each be reporting potential nonsense.
> >
> >> > The best approach seems to be clearing the TIF_*WORK* bits, but that's
> >> > still a bit arch-specific. And I'm not sure which layer would do that.
> >> > At what point have we decided the process will not continue? More
> >> > than seccomp was calling do_exit() in the middle of a syscall, but those
> >> > appear to have all been either SIGKILL or SIGSEGV?
> >>
> >> This is where I get confused what TIF_WORK bits matter?
> >
> > This is where I wish all the architectures were using the common syscall
> > code. The old do_exit() path would completely skip _everything_ in the
> > exit path, so it was like never calling anything after the syscall
> > dispatch table. The only userspace visible things in there are triggered
> > from having TIF_WORK... flags (but again, it's kind of a per-arch mess).
> >
> > Skipping the entire exit path makes a fair bit of sense. For example,
> > rseq_syscall() is redundant (forcing SIGSEGV).
> >
> > Regardless, at least the three places above need to be skipped.
> >
> > But just testing fatal_signal_pending() seems wrong: a normal syscall
> > could be finishing just fine, it just happens to have a fatal signal
> > ready to be processed.
>
> Yes. It is really just the HANDLER_EXIT case where this is interesting.
Right.
>
> >
> > Here's the ordering after a syscall on x86 from do_syscall_64():
> >
> > do_syscall_x64()
> > sys_call_table[...](regs)
> > syscall_exit_to_user_mode()
> > __syscall_exit_to_user_mode_work()
> > syscall_exit_to_user_mode_prepare()
> > syscall_exit_work()
> > arch_syscall_exit_tracehook()
> > tracehook_report_syscall_exit()
> > exit_to_user_mode_prepare()
> > exit_to_user_mode_loop()
> > handle_signal_work()
> > arch_do_signal_or_restart()
> > get_signal()
> > do_group_exit()
> >
> > Here's arm64 from el0_svc():
> >
> > do_el0_svc()
> > el0_svc_common()
> > invoke_syscall()
> > syscall_table[...](regs)
> > syscall_trace_exit()
> > tracehook_report_syscall()
> > tracehook_report_syscall_exit()
> > exit_to_user_mode()
> > prepare_exit_to_user_mode()
> > do_notify_resume()
> > do_signal()
> > get_signal()
> > do_group_exit()
> >
> > In the past, any do_exit() would short circuit everything after the
> > syscall table. Now, we do all the exit work before starting the return
> > to user mode which is what processes the signals. So I guess there's
> > more precisely a difference between "visible to userspace" and "return
> > to userspace".
>
> Yes. I see that now. I had not had an occasion to look at the order
> all of these were called in before and my mental model was wrong.
Yeah, I didn't even have a model of this all the way. I'd really only
understood the ptrace side of it.
> It makes a certain kind of sense that the per syscall work happens
> before we do additional things like process signals. It simply
> had not realized that was happening in that order until now.
>
>
> > (an aside: where to PF_IO_WORKER threads die?)
>
> They are calling do_exit explicitly.
Ah-ha, thanks.
>
> >> I expect if anything else mattered we would need to change it to
> >> HANDLER_EXIT.
> >>
> >> I made a mistake conflating to cases and I want to make certain I
> >> successfully separate those two cases at the end of the day.
> >
> > For skipping the exit work, I'm not sure it matters, since all the
> > signal stuff is "too late"...
>
> The conflation lead me to believe that we could simply and safely cause
> seccomp to use normal signal delivery to kill the process. The first
> part of the conflation I sorted out by introducing HANDLER_EXIT. The
> user visible part of the change I am not yet certain what to do with.
>
> My gut reaction is does it matter? Can you escape the seccomp filter
> with a stop? Does it break userspace?
After fixing UNKILLABLE vs IMMUTABLE, I'm not aware of anything else
misbehaving. The new nonsense exit event, though, is bound to be at
least confusing to humans. ("Why did this syscall not change any of its
registers?", etc.)
> I realize the outcome of that question is that it does matter so we
> probably need to find a way to supress that situation for HANDLER_EXIT.
> Both force_exit_sig and force_sig_seccomp appear to be using dumpable
> signals which makes the problem doubly tricky.
>
> The first tricky bit is fatal_signal_pending isn't set because a
> coredump is possible, so something else is needed to detect this
> condition.
>
> The second part is what to do when we detect the condition.
>
> The only solution I can think of quickly is to modify
> force_sig_info_to_task clear TIF_SYSCALL_WORK on the architectures where
> that is used and to clear SYSCALL_WORK_EXIT on x86 and s390, and to do
> whatever the architecture appropriate thing is on the other
> architectures.
The common accessors for the bits are set_syscall_work()/clear_syscall_work()
but I don't see anything to operate on an entire mask. Maybe it needs to
grow something like reset_syscall_work()?
-Kees
>
> It might be easier once I have cleaned some more code up but as long as
> we can limit the logic to force_sig_info_to_task for the HANDLER_EXIT
> case I don't see any advantage in the cleanups I have planned for this
> case. This may be incentive to make the architecture code more uniform
> in this area.
>
> Anyway I am off to finish fixing some other bugs.
>
> Eric
>
>
--
Kees Cook
On Thu, Feb 10, 2022 at 07:47:00PM -0600, Eric W. Biederman wrote:
> Kees Cook <[email protected]> writes:
> > The common accessors for the bits are set_syscall_work()/clear_syscall_work()
> > but I don't see anything to operate on an entire mask. Maybe it needs to
> > grow something like reset_syscall_work()?
>
> Oh. I hadn't realized SYSCALL_WORK_EXIT and TIF_SYSCALL_WORK were
> masks. Yes it looks like a simple addition of reset_syscall_work()
> and calling it from force_sig_info when HANDLER_EXIT will hide these
> events.
Yeah, it's varied by features, architectures, etc. The good news is it's
WAY nicer now after the USER_DISPATCH. :)
> When you say the events are corrupted did you mean they return wrong data
> to userspace or simply that the events should not fire?
It's mainly about the exit stuff having never been run before on these
kinds of process states, so things don't make sense. For example, on the
SIGSYS death, the registers have been rewound for the coredump, so when
the exit trace runs on x86 it sees the syscall return value as equal to
the syscall number (since %rax is used for the syscall number on entry
and for the syscall result on exit). So when a tracer watches a seccomp
fatal SIGSYS, it sees the syscall exit before it sees the child exit
(and therefore the signal). For example, x86_64 write (syscall number
1), will return as if it had written 1 byte. :P
So, it's not harmful, but it's confusing and weird. :)
> I am trying to figure out if there is a case to be made that it was a
> bug that these events were missing.
I don't think so -- the syscall did not finish, so there isn't a valid
return code. The process exited before it completed.
--
Kees Cook
Kees Cook <[email protected]> writes:
> On Thu, Feb 10, 2022 at 12:58:07PM -0600, Eric W. Biederman wrote:
>> Kees Cook <[email protected]> writes:
>>
>> > On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
>> >> Kees Cook <[email protected]> writes:
>> >>
>> >> > Hi,
>> >> >
>> >> > This fixes the signal refactoring to actually kill unkillable processes
>> >> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
>> >> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
>> >> > fail more safely. This was a partial seccomp bypass, in the sense that
>> >> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
>> >> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
>> >>
>> >> Any luck on figuring out how to suppress the extra event?
>> >
>> > I haven't found a good single indicator of a process being in an "I am dying"
>> > state, and even if I did, it seems every architecture's exit path would
>> > need to add a new test.
>>
>> The "I am dying" state for a task is fatal_signal_pending, at least
>> before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT.
>> Something I am busily cleaning up and making more reliable at the
>> moment.
>
> The state I need to catch is "I am dying and this syscall was
> interrupted". fatal_signal_pending() is kind of only the first half
> (though it doesn't cover fatal SIGSYS?)
>
> For example, if a process hits a BUG() in the middle of running a
> syscall, that syscall isn't expected to "exit" from the perspective of
> userspace. This is similarly true for seccomp's fatal SIGSYS.
>
>> What is the event that is happening? Is it
>> tracehook_report_syscall_exit or something else?
>
> Yes, but in more completely, it's these three, which are called in
> various fashions from architecture syscall exit code:
>
> audit_syscall_exit() (audit)
> trace_sys_exit() (see "TRACE_EVENT_FN(sys_exit,")
> tracehook_report_syscall_exit() (ptrace)
>
>> From the bits I have seen it seems like something else.
>
> But yes, the place Robert and I both noticed it was with ptrace from
> tracehook_report_syscall_exit(), which is rather poorly named. :)
Speaking of patches I am just about to send out.
> Looking at the results, audit_syscall_exit() and trace_sys_exit() need
> to be skipped too, since they would each be reporting potential nonsense.
>
>> > The best approach seems to be clearing the TIF_*WORK* bits, but that's
>> > still a bit arch-specific. And I'm not sure which layer would do that.
>> > At what point have we decided the process will not continue? More
>> > than seccomp was calling do_exit() in the middle of a syscall, but those
>> > appear to have all been either SIGKILL or SIGSEGV?
>>
>> This is where I get confused what TIF_WORK bits matter?
>
> This is where I wish all the architectures were using the common syscall
> code. The old do_exit() path would completely skip _everything_ in the
> exit path, so it was like never calling anything after the syscall
> dispatch table. The only userspace visible things in there are triggered
> from having TIF_WORK... flags (but again, it's kind of a per-arch mess).
>
> Skipping the entire exit path makes a fair bit of sense. For example,
> rseq_syscall() is redundant (forcing SIGSEGV).
>
> Regardless, at least the three places above need to be skipped.
>
> But just testing fatal_signal_pending() seems wrong: a normal syscall
> could be finishing just fine, it just happens to have a fatal signal
> ready to be processed.
Yes. It is really just the HANDLER_EXIT case where this is interesting.
>
> Here's the ordering after a syscall on x86 from do_syscall_64():
>
> do_syscall_x64()
> sys_call_table[...](regs)
> syscall_exit_to_user_mode()
> __syscall_exit_to_user_mode_work()
> syscall_exit_to_user_mode_prepare()
> syscall_exit_work()
> arch_syscall_exit_tracehook()
> tracehook_report_syscall_exit()
> exit_to_user_mode_prepare()
> exit_to_user_mode_loop()
> handle_signal_work()
> arch_do_signal_or_restart()
> get_signal()
> do_group_exit()
>
> Here's arm64 from el0_svc():
>
> do_el0_svc()
> el0_svc_common()
> invoke_syscall()
> syscall_table[...](regs)
> syscall_trace_exit()
> tracehook_report_syscall()
> tracehook_report_syscall_exit()
> exit_to_user_mode()
> prepare_exit_to_user_mode()
> do_notify_resume()
> do_signal()
> get_signal()
> do_group_exit()
>
> In the past, any do_exit() would short circuit everything after the
> syscall table. Now, we do all the exit work before starting the return
> to user mode which is what processes the signals. So I guess there's
> more precisely a difference between "visible to userspace" and "return
> to userspace".
Yes. I see that now. I had not had an occasion to look at the order
all of these were called in before and my mental model was wrong.
It makes a certain kind of sense that the per syscall work happens
before we do additional things like process signals. It simply
had not realized that was happening in that order until now.
> (an aside: where to PF_IO_WORKER threads die?)
They are calling do_exit explicitly.
>> I expect if anything else mattered we would need to change it to
>> HANDLER_EXIT.
>>
>> I made a mistake conflating to cases and I want to make certain I
>> successfully separate those two cases at the end of the day.
>
> For skipping the exit work, I'm not sure it matters, since all the
> signal stuff is "too late"...
The conflation lead me to believe that we could simply and safely cause
seccomp to use normal signal delivery to kill the process. The first
part of the conflation I sorted out by introducing HANDLER_EXIT. The
user visible part of the change I am not yet certain what to do with.
My gut reaction is does it matter? Can you escape the seccomp filter
with a stop? Does it break userspace?
I realize the outcome of that question is that it does matter so we
probably need to find a way to supress that situation for HANDLER_EXIT.
Both force_exit_sig and force_sig_seccomp appear to be using dumpable
signals which makes the problem doubly tricky.
The first tricky bit is fatal_signal_pending isn't set because a
coredump is possible, so something else is needed to detect this
condition.
The second part is what to do when we detect the condition.
The only solution I can think of quickly is to modify
force_sig_info_to_task clear TIF_SYSCALL_WORK on the architectures where
that is used and to clear SYSCALL_WORK_EXIT on x86 and s390, and to do
whatever the architecture appropriate thing is on the other
architectures.
It might be easier once I have cleaned some more code up but as long as
we can limit the logic to force_sig_info_to_task for the HANDLER_EXIT
case I don't see any advantage in the cleanups I have planned for this
case. This may be incentive to make the architecture code more uniform
in this area.
Anyway I am off to finish fixing some other bugs.
Eric
Kees Cook <[email protected]> writes:
> On Thu, Feb 10, 2022 at 04:48:30PM -0600, Eric W. Biederman wrote:
>> Kees Cook <[email protected]> writes:
>>
>> > On Thu, Feb 10, 2022 at 12:58:07PM -0600, Eric W. Biederman wrote:
>> >> Kees Cook <[email protected]> writes:
>> >>
>> >> > On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
>> >> >> Kees Cook <[email protected]> writes:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > This fixes the signal refactoring to actually kill unkillable processes
>> >> >> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
>> >> >> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
>> >> >> > fail more safely. This was a partial seccomp bypass, in the sense that
>> >> >> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
>> >> >> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
>> >> >>
>> >> >> Any luck on figuring out how to suppress the extra event?
>> >> >
>> >> > I haven't found a good single indicator of a process being in an "I am dying"
>> >> > state, and even if I did, it seems every architecture's exit path would
>> >> > need to add a new test.
>> >>
>> >> The "I am dying" state for a task is fatal_signal_pending, at least
>> >> before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT.
>> >> Something I am busily cleaning up and making more reliable at the
>> >> moment.
>> >
>> > The state I need to catch is "I am dying and this syscall was
>> > interrupted". fatal_signal_pending() is kind of only the first half
>> > (though it doesn't cover fatal SIGSYS?)
>> >
>> > For example, if a process hits a BUG() in the middle of running a
>> > syscall, that syscall isn't expected to "exit" from the perspective of
>> > userspace. This is similarly true for seccomp's fatal SIGSYS.
>> >
>> >> What is the event that is happening? Is it
>> >> tracehook_report_syscall_exit or something else?
>> >
>> > Yes, but in more completely, it's these three, which are called in
>> > various fashions from architecture syscall exit code:
>> >
>> > audit_syscall_exit() (audit)
>> > trace_sys_exit() (see "TRACE_EVENT_FN(sys_exit,")
>> > tracehook_report_syscall_exit() (ptrace)
>> >
>> >> From the bits I have seen it seems like something else.
>> >
>> > But yes, the place Robert and I both noticed it was with ptrace from
>> > tracehook_report_syscall_exit(), which is rather poorly named. :)
>>
>> Speaking of patches I am just about to send out.
>>
>> > Looking at the results, audit_syscall_exit() and trace_sys_exit() need
>> > to be skipped too, since they would each be reporting potential nonsense.
>> >
>> >> > The best approach seems to be clearing the TIF_*WORK* bits, but that's
>> >> > still a bit arch-specific. And I'm not sure which layer would do that.
>> >> > At what point have we decided the process will not continue? More
>> >> > than seccomp was calling do_exit() in the middle of a syscall, but those
>> >> > appear to have all been either SIGKILL or SIGSEGV?
>> >>
>> >> This is where I get confused what TIF_WORK bits matter?
>> >
>> > This is where I wish all the architectures were using the common syscall
>> > code. The old do_exit() path would completely skip _everything_ in the
>> > exit path, so it was like never calling anything after the syscall
>> > dispatch table. The only userspace visible things in there are triggered
>> > from having TIF_WORK... flags (but again, it's kind of a per-arch mess).
>> >
>> > Skipping the entire exit path makes a fair bit of sense. For example,
>> > rseq_syscall() is redundant (forcing SIGSEGV).
>> >
>> > Regardless, at least the three places above need to be skipped.
>> >
>> > But just testing fatal_signal_pending() seems wrong: a normal syscall
>> > could be finishing just fine, it just happens to have a fatal signal
>> > ready to be processed.
>>
>> Yes. It is really just the HANDLER_EXIT case where this is interesting.
>
> Right.
>
>>
>> >
>> > Here's the ordering after a syscall on x86 from do_syscall_64():
>> >
>> > do_syscall_x64()
>> > sys_call_table[...](regs)
>> > syscall_exit_to_user_mode()
>> > __syscall_exit_to_user_mode_work()
>> > syscall_exit_to_user_mode_prepare()
>> > syscall_exit_work()
>> > arch_syscall_exit_tracehook()
>> > tracehook_report_syscall_exit()
>> > exit_to_user_mode_prepare()
>> > exit_to_user_mode_loop()
>> > handle_signal_work()
>> > arch_do_signal_or_restart()
>> > get_signal()
>> > do_group_exit()
>> >
>> > Here's arm64 from el0_svc():
>> >
>> > do_el0_svc()
>> > el0_svc_common()
>> > invoke_syscall()
>> > syscall_table[...](regs)
>> > syscall_trace_exit()
>> > tracehook_report_syscall()
>> > tracehook_report_syscall_exit()
>> > exit_to_user_mode()
>> > prepare_exit_to_user_mode()
>> > do_notify_resume()
>> > do_signal()
>> > get_signal()
>> > do_group_exit()
>> >
>> > In the past, any do_exit() would short circuit everything after the
>> > syscall table. Now, we do all the exit work before starting the return
>> > to user mode which is what processes the signals. So I guess there's
>> > more precisely a difference between "visible to userspace" and "return
>> > to userspace".
>>
>> Yes. I see that now. I had not had an occasion to look at the order
>> all of these were called in before and my mental model was wrong.
>
> Yeah, I didn't even have a model of this all the way. I'd really only
> understood the ptrace side of it.
>
>> It makes a certain kind of sense that the per syscall work happens
>> before we do additional things like process signals. It simply
>> had not realized that was happening in that order until now.
>>
>>
>> > (an aside: where to PF_IO_WORKER threads die?)
>>
>> They are calling do_exit explicitly.
>
> Ah-ha, thanks.
>
>>
>> >> I expect if anything else mattered we would need to change it to
>> >> HANDLER_EXIT.
>> >>
>> >> I made a mistake conflating to cases and I want to make certain I
>> >> successfully separate those two cases at the end of the day.
>> >
>> > For skipping the exit work, I'm not sure it matters, since all the
>> > signal stuff is "too late"...
>>
>> The conflation lead me to believe that we could simply and safely cause
>> seccomp to use normal signal delivery to kill the process. The first
>> part of the conflation I sorted out by introducing HANDLER_EXIT. The
>> user visible part of the change I am not yet certain what to do with.
>>
>> My gut reaction is does it matter? Can you escape the seccomp filter
>> with a stop? Does it break userspace?
>
> After fixing UNKILLABLE vs IMMUTABLE, I'm not aware of anything else
> misbehaving. The new nonsense exit event, though, is bound to be at
> least confusing to humans. ("Why did this syscall not change any of its
> registers?", etc.)
>
>> I realize the outcome of that question is that it does matter so we
>> probably need to find a way to supress that situation for HANDLER_EXIT.
>> Both force_exit_sig and force_sig_seccomp appear to be using dumpable
>> signals which makes the problem doubly tricky.
>>
>> The first tricky bit is fatal_signal_pending isn't set because a
>> coredump is possible, so something else is needed to detect this
>> condition.
>>
>> The second part is what to do when we detect the condition.
>>
>> The only solution I can think of quickly is to modify
>> force_sig_info_to_task clear TIF_SYSCALL_WORK on the architectures where
>> that is used and to clear SYSCALL_WORK_EXIT on x86 and s390, and to do
>> whatever the architecture appropriate thing is on the other
>> architectures.
>
> The common accessors for the bits are set_syscall_work()/clear_syscall_work()
> but I don't see anything to operate on an entire mask. Maybe it needs to
> grow something like reset_syscall_work()?
Oh. I hadn't realized SYSCALL_WORK_EXIT and TIF_SYSCALL_WORK were
masks. Yes it looks like a simple addition of reset_syscall_work()
and calling it from force_sig_info when HANDLER_EXIT will hide these
events.
When you say the events are corrupted did you mean they return wrong data
to userspace or simply that the events should not fire?
I am trying to figure out if there is a case to be made that it was a
bug that these events were missing.
Eric
Jann Horn <[email protected]> writes:
> On Thu, Feb 10, 2022 at 6:37 PM Kees Cook <[email protected]> wrote:
>> On Thu, Feb 10, 2022 at 05:18:39PM +0100, Jann Horn wrote:
>> > On Thu, Feb 10, 2022 at 3:53 AM Kees Cook <[email protected]> wrote:
>> > > Fatal SIGSYS signals were not being delivered to pid namespace init
>> > > processes. Make sure the SIGNAL_UNKILLABLE doesn't get set for these
>> > > cases.
>> > >
>> > > Reported-by: Robert Święcki <[email protected]>
>> > > Suggested-by: "Eric W. Biederman" <[email protected]>
>> > > Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
>> > > Cc: [email protected]
>> > > Signed-off-by: Kees Cook <[email protected]>
>> > > ---
>> > > kernel/signal.c | 5 +++--
>> > > 1 file changed, 3 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/kernel/signal.c b/kernel/signal.c
>> > > index 38602738866e..33e3ee4f3383 100644
>> > > --- a/kernel/signal.c
>> > > +++ b/kernel/signal.c
>> > > @@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
>> > > }
>> > > /*
>> > > * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
>> > > - * debugging to leave init killable.
>> > > + * debugging to leave init killable, unless it is intended to exit.
>> > > */
>> > > - if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
>> > > + if (action->sa.sa_handler == SIG_DFL &&
>> > > + (!t->ptrace || (handler == HANDLER_EXIT)))
>> > > t->signal->flags &= ~SIGNAL_UNKILLABLE;
>> >
>> > You're changing the subclause:
>> >
>> > !t->ptrace
>> >
>> > to:
>> >
>> > (!t->ptrace || (handler == HANDLER_EXIT))
>> >
>> > which means that the change only affects cases where the process has a
>> > ptracer, right? That's not the scenario the commit message is talking
>> > about...
>>
>> Sorry, yes, I was not as accurate as I should have been in the commit
>> log. I have changed it to:
>>
>> Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
>> were not being delivered to ptraced pid namespace init processes. Make
>> sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
>
> So basically force_sig_info() is trying to figure out whether
> get_signal() will later on check for SIGNAL_UNKILLABLE (the SIG_DFL
> case), and if so, it clears the flag from the target's signal_struct
> that marks the process as unkillable?
>
> This used to be:
>
> if (action->sa.sa_handler == SIG_DFL)
> t->signal->flags &= ~SIGNAL_UNKILLABLE;
>
> Then someone noticed that in the ptrace case, the signal might not
> actually end up being consumed by the target process, and added the
> "&& !t->ptrace" clause in commit
> eb61b5911bdc923875cde99eb25203a0e2b06d43.
>
> And now Robert Swiecki noticed that that still didn't accurately model
> what'll happen in get_signal().
>
>
> This seems hacky to me, and also racy: What if, while you're going
> through a SECCOMP_RET_KILL_PROCESS in an unkillable process, some
> other thread e.g. concurrently changes the disposition of SIGSYS from
> a custom handler to SIG_DFL?
>
> Instead of trying to figure out whether the signal would have been
> fatal without SIGNAL_UNKILLABLE, I think it would be better to find a
> way to tell the signal-handling code that SIGNAL_UNKILLABLE should be
> bypassed for this specific signal, or something along those lines...
> but of course that's also kind of messy because the signal-sending
> code might fall back to just using the pending signal mask on
> allocation failure IIRC?
I am actively working on this. I think I know how to get there but
it requires cleanups elsewhere as well, so it is not really an approach
that is appropriate for backporting.
The big bottleneck is that we need to make signals that trigger
coredumps eligible for short circuit delivery, and that takes a little
doing.
Eric
On February 11, 2022 4:54:26 AM PST, "Robert Święcki" <[email protected]> wrote:
>> It's mainly about the exit stuff having never been run before on these
>> kinds of process states, so things don't make sense. For example, on the
>> SIGSYS death, the registers have been rewound for the coredump, so when
>> the exit trace runs on x86 it sees the syscall return value as equal to
>> the syscall number (since %rax is used for the syscall number on entry
>> and for the syscall result on exit). So when a tracer watches a seccomp
>> fatal SIGSYS, it sees the syscall exit before it sees the child exit
>> (and therefore the signal). For example, x86_64 write (syscall number
>> 1), will return as if it had written 1 byte. :P
>>
>> So, it's not harmful, but it's confusing and weird. :)
>>
>> > I am trying to figure out if there is a case to be made that it was a
>> > bug that these events were missing.
>>
>> I don't think so -- the syscall did not finish, so there isn't a valid
>> return code. The process exited before it completed.
>
>A tangential point: please ignore for the purpose of fixing the
>problem at hand. I'm mostly making it, in case it can be taken into
>account in case some bigger changes to this code path are to be made -
>given that it touches the problem of signal delivery.
>
>When I noticed this problem, I was looking for a way to figure out
>what syscall caused SIGSYS (via SECCOMP_RET_KILL_*), and there's no
>easy way to do that programmatically from the perspective of a parent
>process. There are three ways of doing this that come to mind.
I had hoped that the parent could read the SIGSYS siginfo_t from the
child, but I haven't found any way to do this. :( :(
I don't seem to be able to use:
- PTRACE_ATTACH to use PTRACE_PEEKSIGINFO on a dead process.
- signalfd (nothing is in the fd after the exit).
Hmpf.
--
Kees Cook
On Fri, Feb 11, 2022 at 11:46:53AM -0600, Eric W. Biederman wrote:
> Robert Święcki <[email protected]> writes:
> > When I noticed this problem, I was looking for a way to figure out
> > what syscall caused SIGSYS (via SECCOMP_RET_KILL_*), and there's no
> > easy way to do that programmatically from the perspective of a parent
> > process. There are three ways of doing this that come to mind.
>
> Unless I am misunderstanding what you are looking for
> this information is contained within the SIGSYS siginfo.
> The field si_syscall contains the system call number and
> the field si_errno contains return code from the seccomp filter.
>
> All of that can be read from the core dump of the process that exited.
>
> Looking quickly I don't see a good way to pull that signal information
> out of the kernel other than with a coredump.
>
> It might be possible to persuade PTRACE_EVENT_EXIT to give it to you,
> but I haven't looked at it enough to see if that would be a sensible
> strategy.
If there is already a ptrace on the child with PTRACE_O_TRACEEXIT, it
should be possible to use PTRACE_GETSIGINFO.
> > I think it'd be good to have some way of doing it from the perspective
> > of a parent process - it'd simplify development of sandboxing managers
> > (eg nsjail, minijail, firejail), and creation of good seccomp
> > policies.
>
> By development do you mean debugging sandbox managers? Or do you mean
> something that sandbox managers can use on a routine basis?
It'd really be nice to be able to get this info without the ptrace
relationship already in place... hmmm.
--
Kees Cook
pt., 11 lut 2022 o 18:47 Eric W. Biederman <[email protected]> napisał(a):
> > I think it'd be good to have some way of doing it from the perspective
> > of a parent process - it'd simplify development of sandboxing managers
> > (eg nsjail, minijail, firejail), and creation of good seccomp
> > policies.
>
> By development do you mean debugging sandbox managers? Or do you mean
> something that sandbox managers can use on a routine basis?
On a routine basis for 2 purposes
a). development of seccomp policies - the manager (regular parent of
sandboxed process) can tell which syscall (and arguments) failed and
it can be then added to policy (though, 'strace -f -c' might be a
better option here)
b). in case of actual seccomp SIGSYS kill, it could then inform users
about what exactly and where happened (syscall no, cpu arch,
arguments, maybe also eip + stack ptr)
But, I don't want to derail the current bug fixing effort, so I just
wanted to mention all of this quickly, and maybe we can follow up on
this in the future.
On Thu, Feb 10, 2022 at 12:58:07PM -0600, Eric W. Biederman wrote:
> Kees Cook <[email protected]> writes:
>
> > On Thu, Feb 10, 2022 at 12:17:50PM -0600, Eric W. Biederman wrote:
> >> Kees Cook <[email protected]> writes:
> >>
> >> > Hi,
> >> >
> >> > This fixes the signal refactoring to actually kill unkillable processes
> >> > when receiving a fatal SIGSYS from seccomp. Thanks to Robert for the
> >> > report and Eric for the fix! I've also tweaked seccomp internal a bit to
> >> > fail more safely. This was a partial seccomp bypass, in the sense that
> >> > SECCOMP_RET_KILL_* didn't kill the process, but it didn't bypass other
> >> > aspects of the filters. (i.e. the syscall was still blocked, etc.)
> >>
> >> Any luck on figuring out how to suppress the extra event?
> >
> > I haven't found a good single indicator of a process being in an "I am dying"
> > state, and even if I did, it seems every architecture's exit path would
> > need to add a new test.
>
> The "I am dying" state for a task is fatal_signal_pending, at least
> before get_signal is reached, for a process there is SIGNAL_GROUP_EXIT.
> Something I am busily cleaning up and making more reliable at the
> moment.
The state I need to catch is "I am dying and this syscall was
interrupted". fatal_signal_pending() is kind of only the first half
(though it doesn't cover fatal SIGSYS?)
For example, if a process hits a BUG() in the middle of running a
syscall, that syscall isn't expected to "exit" from the perspective of
userspace. This is similarly true for seccomp's fatal SIGSYS.
> What is the event that is happening? Is it
> tracehook_report_syscall_exit or something else?
Yes, but in more completely, it's these three, which are called in
various fashions from architecture syscall exit code:
audit_syscall_exit() (audit)
trace_sys_exit() (see "TRACE_EVENT_FN(sys_exit,")
tracehook_report_syscall_exit() (ptrace)
> From the bits I have seen it seems like something else.
But yes, the place Robert and I both noticed it was with ptrace from
tracehook_report_syscall_exit(), which is rather poorly named. :)
Looking at the results, audit_syscall_exit() and trace_sys_exit() need
to be skipped too, since they would each be reporting potential nonsense.
> > The best approach seems to be clearing the TIF_*WORK* bits, but that's
> > still a bit arch-specific. And I'm not sure which layer would do that.
> > At what point have we decided the process will not continue? More
> > than seccomp was calling do_exit() in the middle of a syscall, but those
> > appear to have all been either SIGKILL or SIGSEGV?
>
> This is where I get confused what TIF_WORK bits matter?
This is where I wish all the architectures were using the common syscall
code. The old do_exit() path would completely skip _everything_ in the
exit path, so it was like never calling anything after the syscall
dispatch table. The only userspace visible things in there are triggered
from having TIF_WORK... flags (but again, it's kind of a per-arch mess).
Skipping the entire exit path makes a fair bit of sense. For example,
rseq_syscall() is redundant (forcing SIGSEGV).
Regardless, at least the three places above need to be skipped.
But just testing fatal_signal_pending() seems wrong: a normal syscall
could be finishing just fine, it just happens to have a fatal signal
ready to be processed.
Here's the ordering after a syscall on x86 from do_syscall_64():
do_syscall_x64()
sys_call_table[...](regs)
syscall_exit_to_user_mode()
__syscall_exit_to_user_mode_work()
syscall_exit_to_user_mode_prepare()
syscall_exit_work()
arch_syscall_exit_tracehook()
tracehook_report_syscall_exit()
exit_to_user_mode_prepare()
exit_to_user_mode_loop()
handle_signal_work()
arch_do_signal_or_restart()
get_signal()
do_group_exit()
Here's arm64 from el0_svc():
do_el0_svc()
el0_svc_common()
invoke_syscall()
syscall_table[...](regs)
syscall_trace_exit()
tracehook_report_syscall()
tracehook_report_syscall_exit()
exit_to_user_mode()
prepare_exit_to_user_mode()
do_notify_resume()
do_signal()
get_signal()
do_group_exit()
In the past, any do_exit() would short circuit everything after the
syscall table. Now, we do all the exit work before starting the return
to user mode which is what processes the signals. So I guess there's
more precisely a difference between "visible to userspace" and "return
to userspace".
(an aside: where to PF_IO_WORKER threads die?)
> I expect if anything else mattered we would need to change it to
> HANDLER_EXIT.
>
> I made a mistake conflating to cases and I want to make certain I
> successfully separate those two cases at the end of the day.
For skipping the exit work, I'm not sure it matters, since all the
signal stuff is "too late"...
--
Kees Cook
On Thu, Feb 10, 2022 at 10:09 PM Kees Cook <[email protected]> wrote:
> On Thu, Feb 10, 2022 at 07:01:39PM +0100, Jann Horn wrote:
> > On Thu, Feb 10, 2022 at 6:37 PM Kees Cook <[email protected]> wrote:
> > > On Thu, Feb 10, 2022 at 05:18:39PM +0100, Jann Horn wrote:
> > > > On Thu, Feb 10, 2022 at 3:53 AM Kees Cook <[email protected]> wrote:
> > > > > Fatal SIGSYS signals were not being delivered to pid namespace init
> > > > > processes. Make sure the SIGNAL_UNKILLABLE doesn't get set for these
> > > > > cases.
> > > > >
> > > > > Reported-by: Robert Święcki <[email protected]>
> > > > > Suggested-by: "Eric W. Biederman" <[email protected]>
> > > > > Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
> > > > > Cc: [email protected]
> > > > > Signed-off-by: Kees Cook <[email protected]>
> > > > > ---
> > > > > kernel/signal.c | 5 +++--
> > > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > > > index 38602738866e..33e3ee4f3383 100644
> > > > > --- a/kernel/signal.c
> > > > > +++ b/kernel/signal.c
> > > > > @@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
> > > > > }
> > > > > /*
> > > > > * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
> > > > > - * debugging to leave init killable.
> > > > > + * debugging to leave init killable, unless it is intended to exit.
> > > > > */
> > > > > - if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
> > > > > + if (action->sa.sa_handler == SIG_DFL &&
> > > > > + (!t->ptrace || (handler == HANDLER_EXIT)))
> > > > > t->signal->flags &= ~SIGNAL_UNKILLABLE;
> > > >
> > > > You're changing the subclause:
> > > >
> > > > !t->ptrace
> > > >
> > > > to:
> > > >
> > > > (!t->ptrace || (handler == HANDLER_EXIT))
> > > >
> > > > which means that the change only affects cases where the process has a
> > > > ptracer, right? That's not the scenario the commit message is talking
> > > > about...
> > >
> > > Sorry, yes, I was not as accurate as I should have been in the commit
> > > log. I have changed it to:
> > >
> > > Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
> > > were not being delivered to ptraced pid namespace init processes. Make
> > > sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
> >
> > So basically force_sig_info() is trying to figure out whether
> > get_signal() will later on check for SIGNAL_UNKILLABLE (the SIG_DFL
> > case), and if so, it clears the flag from the target's signal_struct
> > that marks the process as unkillable?
> >
> > This used to be:
> >
> > if (action->sa.sa_handler == SIG_DFL)
> > t->signal->flags &= ~SIGNAL_UNKILLABLE;
> >
> > Then someone noticed that in the ptrace case, the signal might not
> > actually end up being consumed by the target process, and added the
> > "&& !t->ptrace" clause in commit
> > eb61b5911bdc923875cde99eb25203a0e2b06d43.
> >
> > And now Robert Swiecki noticed that that still didn't accurately model
> > what'll happen in get_signal().
> >
> > This seems hacky to me, and also racy: What if, while you're going
> > through a SECCOMP_RET_KILL_PROCESS in an unkillable process, some
> > other thread e.g. concurrently changes the disposition of SIGSYS from
> > a custom handler to SIG_DFL?
>
> Do you mean after force_sig_info_to_task() has finished but before
> get_signal()? SA_IMMUTABLE will block changes to the action.
Yeah, that's what I meant.
Thanks, I missed SA_IMMUTABLE. Ugh, this is not pretty code...
Robert Święcki <[email protected]> writes:
>> It's mainly about the exit stuff having never been run before on these
>> kinds of process states, so things don't make sense. For example, on the
>> SIGSYS death, the registers have been rewound for the coredump, so when
>> the exit trace runs on x86 it sees the syscall return value as equal to
>> the syscall number (since %rax is used for the syscall number on entry
>> and for the syscall result on exit). So when a tracer watches a seccomp
>> fatal SIGSYS, it sees the syscall exit before it sees the child exit
>> (and therefore the signal). For example, x86_64 write (syscall number
>> 1), will return as if it had written 1 byte. :P
>>
>> So, it's not harmful, but it's confusing and weird. :)
>>
>> > I am trying to figure out if there is a case to be made that it was a
>> > bug that these events were missing.
>>
>> I don't think so -- the syscall did not finish, so there isn't a valid
>> return code. The process exited before it completed.
With the process state rewound for the coredump it makes sense
why the syscall exit would be meaningless. So at least for
now I am convinced function that clears all syscall_work flags
is the way to go.
> A tangential point: please ignore for the purpose of fixing the
> problem at hand. I'm mostly making it, in case it can be taken into
> account in case some bigger changes to this code path are to be made -
> given that it touches the problem of signal delivery.
>
> When I noticed this problem, I was looking for a way to figure out
> what syscall caused SIGSYS (via SECCOMP_RET_KILL_*), and there's no
> easy way to do that programmatically from the perspective of a parent
> process. There are three ways of doing this that come to mind.
Unless I am misunderstanding what you are looking for
this information is contained within the SIGSYS siginfo.
The field si_syscall contains the system call number and
the field si_errno contains return code from the seccomp filter.
All of that can be read from the core dump of the process that exited.
Looking quickly I don't see a good way to pull that signal information
out of the kernel other than with a coredump.
It might be possible to persuade PTRACE_EVENT_EXIT to give it to you,
but I haven't looked at it enough to see if that would be a sensible
strategy.
> 1). Keep reference to /proc/<child>/syscall and read it upon process
> exiting by SIGSYS (and reading it with wait/id(WNOWAIT) from parent).
> This used to work a long time ago, but was racy (I reported this
> problem many years ago), and currently only -1 0 0 is returned (as in,
> no syscall in progress).
That might be a bug worth fixing. But it would definitely need a test
that is run regularly to prevent future regressions.
> 2). Use ptrace - it works but it changes the logic of the signal
> delivery inside a traced process and requires non-trivial code to make
> it work correctly: use of PT_INTERRUPT, understanding all signal
> delivery events, registers and their mapping to syscall arguments per
> CPU arch.
I guess this works because you can see which syscall occurred before the
SECCOMP_RET_KILL. Except for the bugs we are discussing fixing there
isn't a ptrace_stop after SECCOMP_RET_KILL.
> 3). auditd will print details of failed syscall to kmsg, but the
> string is not very structured, and auditd might not be always present
> inside kernels. And reading that data via netlink requires root IIRC.
I assume this is the same you can see which syscall occurred before
the SECCOMP_RET_KILL.
>
> I think it'd be good to have some way of doing it from the perspective
> of a parent process - it'd simplify development of sandboxing managers
> (eg nsjail, minijail, firejail), and creation of good seccomp
> policies.
By development do you mean debugging sandbox managers? Or do you mean
something that sandbox managers can use on a routine basis?
Eric
> It's mainly about the exit stuff having never been run before on these
> kinds of process states, so things don't make sense. For example, on the
> SIGSYS death, the registers have been rewound for the coredump, so when
> the exit trace runs on x86 it sees the syscall return value as equal to
> the syscall number (since %rax is used for the syscall number on entry
> and for the syscall result on exit). So when a tracer watches a seccomp
> fatal SIGSYS, it sees the syscall exit before it sees the child exit
> (and therefore the signal). For example, x86_64 write (syscall number
> 1), will return as if it had written 1 byte. :P
>
> So, it's not harmful, but it's confusing and weird. :)
>
> > I am trying to figure out if there is a case to be made that it was a
> > bug that these events were missing.
>
> I don't think so -- the syscall did not finish, so there isn't a valid
> return code. The process exited before it completed.
A tangential point: please ignore for the purpose of fixing the
problem at hand. I'm mostly making it, in case it can be taken into
account in case some bigger changes to this code path are to be made -
given that it touches the problem of signal delivery.
When I noticed this problem, I was looking for a way to figure out
what syscall caused SIGSYS (via SECCOMP_RET_KILL_*), and there's no
easy way to do that programmatically from the perspective of a parent
process. There are three ways of doing this that come to mind.
1). Keep reference to /proc/<child>/syscall and read it upon process
exiting by SIGSYS (and reading it with wait/id(WNOWAIT) from parent).
This used to work a long time ago, but was racy (I reported this
problem many years ago), and currently only -1 0 0 is returned (as in,
no syscall in progress).
2). Use ptrace - it works but it changes the logic of the signal
delivery inside a traced process and requires non-trivial code to make
it work correctly: use of PT_INTERRUPT, understanding all signal
delivery events, registers and their mapping to syscall arguments per
CPU arch.
3). auditd will print details of failed syscall to kmsg, but the
string is not very structured, and auditd might not be always present
inside kernels. And reading that data via netlink requires root IIRC.
I think it'd be good to have some way of doing it from the perspective
of a parent process - it'd simplify development of sandboxing managers
(eg nsjail, minijail, firejail), and creation of good seccomp
policies.
On Thu, Feb 10, 2022 at 3:53 AM Kees Cook <[email protected]> wrote:
> Fatal SIGSYS signals were not being delivered to pid namespace init
> processes. Make sure the SIGNAL_UNKILLABLE doesn't get set for these
> cases.
>
> Reported-by: Robert Święcki <[email protected]>
> Suggested-by: "Eric W. Biederman" <[email protected]>
> Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
> Cc: [email protected]
> Signed-off-by: Kees Cook <[email protected]>
> ---
> kernel/signal.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 38602738866e..33e3ee4f3383 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1342,9 +1342,10 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t,
> }
> /*
> * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
> - * debugging to leave init killable.
> + * debugging to leave init killable, unless it is intended to exit.
> */
> - if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
> + if (action->sa.sa_handler == SIG_DFL &&
> + (!t->ptrace || (handler == HANDLER_EXIT)))
> t->signal->flags &= ~SIGNAL_UNKILLABLE;
You're changing the subclause:
!t->ptrace
to:
(!t->ptrace || (handler == HANDLER_EXIT))
which means that the change only affects cases where the process has a
ptracer, right? That's not the scenario the commit message is talking
about...
> ret = send_signal(sig, info, t, PIDTYPE_PID);
> spin_unlock_irqrestore(&t->sighand->siglock, flags);
> --
> 2.30.2
>