LinuxLists.cc - [REGRESSION] ptrace broken from "cgroup: cgroup v2 freezer" (76f969e)

2019-05-13 01:22:04

Subject: [REGRESSION] ptrace broken from "cgroup: cgroup v2 freezer" (76f969e)

Hi,

I was trying to use strace recently and found that it exhibited some
strange behavior. I produced this minimal test case:

#include <unistd.h>

int main() {
write(1, "a", 1);
return 0;
}

which, when run using "gcc test.c && strace ./a.out" produces this
strace output:

[ pre-main omitted ]
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[ repeats forever ]

The correct result is of course:

[ pre-main omitted ]
write(1, "a", 1) = 1
exit_group(0) = ?
+++ exited with 0 +++

Strangely, this only occurs when outputting to a tty-like output.
Running "strace ./a.out" from a native Linux x86 console or a terminal
emulator causes the abnormal behavior. However, the following commands
work correctly:

- strace ./a.out >/dev/null
- strace ./a.out >/tmp/a # /tmp is a standard tmpfs
- strace ./a.out >&- # causes -1 EBADF (Bad file descriptor)

"strace -o /tmp/a ./a.out" hangs and produces the above (infinite)
output to /tmp/a.

I bisected this to 76f969e, "cgroup: cgroup v2 freezer". I reverted the
entire patchset (reverting only that one caused a conflict), which
resolved the issue. I skimmed the patch and came up with this
workaround, which also resolves the issue. I am not at all clear on the
technical workings of the patchset, but it seems to me like a process's
frozen status is supposed to be "suspended" when a frozen process is
ptraced, and "unsuspended" when ptracing ends. Therefore, it seems
suspicious to always "enter frozen" whether or not the cgroup is
actually frozen. It seems like the code should instead check if the
cgroup is actually frozen, and if so, restore the frozen status.

I am using systemd but not any other cgroup features. I tried in an
initramfs environment (no systemd, /init -> shell script) and reproduced
the failing test case.

Please CC me on replies.

Thanks,
Alex.

---
kernel/signal.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 62f9aea4a15a..47145d9d89ca 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2110,7 +2110,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
preempt_disable();
read_unlock(&tasklist_lock);
preempt_enable_no_resched();
- cgroup_enter_frozen();
+ //cgroup_enter_frozen();
freezable_schedule();
} else {
/*
@@ -2289,7 +2289,7 @@ static bool do_signal_stop(int signr)
}

/* Now we don't run again until woken by SIGCONT or SIGKILL */
- cgroup_enter_frozen();
+ //cgroup_enter_frozen();
freezable_schedule();
return true;
} else {
--
2.21.0

2019-05-13 02:01:24

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [REGRESSION] ptrace broken from "cgroup: cgroup v2 freezer" (76f969e)

On Sun, 12 May 2019 21:20:12 -0400, "Alex Xu (Hello71)" said:

> I bisected this to 76f969e, "cgroup: cgroup v2 freezer". I reverted the
> entire patchset (reverting only that one caused a conflict), which
> resolved the issue. I skimmed the patch and came up with this
> workaround, which also resolves the issue. I am not at all clear on the
> technical workings of the patchset, but it seems to me like a process's
> frozen status is supposed to be "suspended" when a frozen process is
> ptraced, and "unsuspended" when ptracing ends. Therefore, it seems
> suspicious to always "enter frozen" whether or not the cgroup is
> actually frozen. It seems like the code should instead check if the
> cgroup is actually frozen, and if so, restore the frozen status.

Your analysis seems to be more or less correct, especially since your
one-line workaround patch makes things start working again.

> - cgroup_enter_frozen();
> + //cgroup_enter_frozen();

However, I'm sure this isn't a proper fix - it needs an if statement guarding
it (or a check inside the function). If the kernel is ever in a state where it
*should* be frozen, and it doesn't freeze, things will misbehave.

One has to wonder how many things would be different if the ptrace()
system call wasn't such a steaming pile of dingo's kidneys....

Attachments:

(No filename) (849.00 B)

2019-05-13 13:40:14

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [REGRESSION] ptrace broken from "cgroup: cgroup v2 freezer" (76f969e)

On 05/12, Alex Xu (Hello71) wrote:
>
> Hi,
>
> I was trying to use strace recently and found that it exhibited some
> strange behavior. I produced this minimal test case:
>
> #include <unistd.h>
>
> int main() {
> write(1, "a", 1);
> return 0;
> }
>
> which, when run using "gcc test.c && strace ./a.out" produces this
> strace output:
>
> [ pre-main omitted ]
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> [ repeats forever ]

Yes, cgroup_enter_frozen() alone is wrong, we have already discussed this
a bit... see https://lore.kernel.org/lkml/[email protected]/

Probably we add leave_frozen(true) after freezable_schedule() for now, then
think try to make something better...

But I am not sure I 100% understand whats going on in this case, could you
try the patch below? (Just in case, of course it is wrong).

Oleg.

--- x/kernel/signal.c
+++ x/kernel/signal.c
@@ -149,8 +149,7 @@
{
if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE)) ||
PENDING(&t->pending, &t->blocked) ||
- PENDING(&t->signal->shared_pending, &t->blocked) ||
- cgroup_task_frozen(t)) {
+ PENDING(&t->signal->shared_pending, &t->blocked) {
set_tsk_thread_flag(t, TIF_SIGPENDING);
return true;
}

2019-05-13 19:27:50

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [REGRESSION] ptrace broken from "cgroup: cgroup v2 freezer" (76f969e)

On 05/13, Oleg Nesterov wrote:
>
> Probably we add leave_frozen(true) after freezable_schedule() for now, then
> think try to make something better...

And again, this is what I thought ptrace_stop() does, somehow I didn't notice
that the last version doesn't have leave_frozen() in ptrace_stop().

Perhaps we can do a bit better, change only tracehook_report_syscall_entry() and
PTRACE_EVENT_EXIT/SECCOMP paths to do leave_frozen() ?

At first glance other callers look fine in that they can do nothing "interesting"
befor get_signal(), but we need to re-check...

Oleg.

> But I am not sure I 100% understand whats going on in this case, could you
> try the patch below? (Just in case, of course it is wrong).
>
> Oleg.
>
> --- x/kernel/signal.c
> +++ x/kernel/signal.c
> @@ -149,8 +149,7 @@
> {
> if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE)) ||
> PENDING(&t->pending, &t->blocked) ||
> - PENDING(&t->signal->shared_pending, &t->blocked) ||
> - cgroup_task_frozen(t)) {
> + PENDING(&t->signal->shared_pending, &t->blocked) {
> set_tsk_thread_flag(t, TIF_SIGPENDING);
> return true;
> }
>

2019-05-13 19:31:16

by Roman Gushchin

[permalink] [raw]

Subject: Re: [REGRESSION] ptrace broken from "cgroup: cgroup v2 freezer" (76f969e)

Hi Alex!

Thank you for the report!
It's super clear, and contains all the details, so it took me 30s
to reproduce the issue. Really appreciate your effort!

On Sun, May 12, 2019 at 09:20:12PM -0400, Alex Xu (Hello71) wrote:
> Hi,
>
> I was trying to use strace recently and found that it exhibited some
> strange behavior. I produced this minimal test case:
>
> #include <unistd.h>
>
> int main() {
> write(1, "a", 1);
> return 0;
> }
>
> which, when run using "gcc test.c && strace ./a.out" produces this
> strace output:
>
> [ pre-main omitted ]
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> [ repeats forever ]
>
> The correct result is of course:
>
> [ pre-main omitted ]
> write(1, "a", 1) = 1
> exit_group(0) = ?
> +++ exited with 0 +++
>
> Strangely, this only occurs when outputting to a tty-like output.
> Running "strace ./a.out" from a native Linux x86 console or a terminal
> emulator causes the abnormal behavior. However, the following commands
> work correctly:
>
> - strace ./a.out >/dev/null
> - strace ./a.out >/tmp/a # /tmp is a standard tmpfs
> - strace ./a.out >&- # causes -1 EBADF (Bad file descriptor)
>
> "strace -o /tmp/a ./a.out" hangs and produces the above (infinite)
> output to /tmp/a.
>
> I bisected this to 76f969e, "cgroup: cgroup v2 freezer". I reverted the
> entire patchset (reverting only that one caused a conflict), which
> resolved the issue. I skimmed the patch and came up with this
> workaround, which also resolves the issue. I am not at all clear on the
> technical workings of the patchset, but it seems to me like a process's
> frozen status is supposed to be "suspended" when a frozen process is
> ptraced, and "unsuspended" when ptracing ends. Therefore, it seems
> suspicious to always "enter frozen" whether or not the cgroup is
> actually frozen. It seems like the code should instead check if the
> cgroup is actually frozen, and if so, restore the frozen status.

So, the thing is that when the freezer tries to freeze all tasks
in the cgroup, some tasks may sleep (e.g. being SIGSTOPPed),
and the freezer can't get them out of this state and put them back correctly
after unfreezing. So instead it leaves such tasks in the original state
and treats them as frozen. This is why we need this unconditional
cgroup_enter_frozen(). It's not the problem.

Anyway, I'm sure that with great help from Oleg we'll be able
to fix the issue very soon (I already posted a preliminary patch).

Once again, thank you for the report!

Roman

2019-05-13 19:31:51

by Roman Gushchin

[permalink] [raw]

Subject: Re: [REGRESSION] ptrace broken from "cgroup: cgroup v2 freezer" (76f969e)

On Mon, May 13, 2019 at 06:38:14PM +0200, Oleg Nesterov wrote:
> On 05/13, Oleg Nesterov wrote:
> >
> > Probably we add leave_frozen(true) after freezable_schedule() for now, then
> > think try to make something better...
>
> And again, this is what I thought ptrace_stop() does, somehow I didn't notice
> that the last version doesn't have leave_frozen() in ptrace_stop().
>
> Perhaps we can do a bit better, change only tracehook_report_syscall_entry() and
> PTRACE_EVENT_EXIT/SECCOMP paths to do leave_frozen() ?
>
> At first glance other callers look fine in that they can do nothing "interesting"
> befor get_signal(), but we need to re-check...

Hi Oleg!

Thank you for looking into it!

I've just check the following patch (see below). It solves the regression
and overall seems correct. But I need some more time to check.

Thanks!

--

diff --git a/kernel/signal.c b/kernel/signal.c
index 8607b11ff936..088b377ad439 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2112,6 +2112,8 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
preempt_enable_no_resched();
cgroup_enter_frozen();
freezable_schedule();
+ if (info)
+ cgroup_leave_frozen(true);
} else {
/*
* By the time we got the lock, our tracer went away.