2020-04-09 17:09:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 9:15 AM Linus Torvalds
<[email protected]> wrote:
>
> may_ptrace_stop() is supposed to stop the blocking exactly so that it
> doesn't deadlock.
>
> I wonder why that doesn't work..
>
> [ Goes and look ]
>
> Oh. I see.
>
> That ptrace_may_stop() only ever considered core-dumping, not execve().
>
> But if _that_ is the reason for the deadlock, then it's trivially fixed.

So maybe may_ptrace_stop() should just do something like this
(ENTIRELY UNTESTED):

struct task_struct *me = current, *parent = me->parent;

if (!likely(me->ptrace))
return false;

/* If the parent is exiting or core-dumping, it's not
listening to our signals */
if (parent->signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP))
return false;

/* if the parent is going through a execve(), it's not listening */
if (parent->signal->group_exit_task)
return false;

return true;

instead of the fairly ad-hoc tests for core-dumping.

The above is hand-wavy - I didn't think a lot about locking.
may_ptrace_stop() is already called under the tasklist_lock, so the
parent won't change, but maybe it should take the signal lock?

So the above very much is *not* meant to be a "do it like this", more
of a "this direction, maybe"?

The existing code is definitely broken. It special-cases core-dumping
probably simply because that's the only case people had realized, and
not thought of the execve() thing.

Linus


2020-04-09 17:17:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1


Adding Oleg to the conversation if for no other reason that he can see
it is happening.

Oleg has had a test case where this can happen for years and nothing
has come out as an obvious proper fix for this deadlock issue.

Linus Torvalds <[email protected]> writes:

> On Thu, Apr 9, 2020 at 9:15 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> may_ptrace_stop() is supposed to stop the blocking exactly so that it
>> doesn't deadlock.
>>
>> I wonder why that doesn't work..
>>
>> [ Goes and look ]
>>
>> Oh. I see.
>>
>> That ptrace_may_stop() only ever considered core-dumping, not execve().
>>
>> But if _that_ is the reason for the deadlock, then it's trivially fixed.
>
> So maybe may_ptrace_stop() should just do something like this
> (ENTIRELY UNTESTED):
>
> struct task_struct *me = current, *parent = me->parent;
>
> if (!likely(me->ptrace))
> return false;
>
> /* If the parent is exiting or core-dumping, it's not
> listening to our signals */
> if (parent->signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP))
> return false;
>
> /* if the parent is going through a execve(), it's not listening */
> if (parent->signal->group_exit_task)
> return false;
>
> return true;
>
> instead of the fairly ad-hoc tests for core-dumping.
>
> The above is hand-wavy - I didn't think a lot about locking.
> may_ptrace_stop() is already called under the tasklist_lock, so the
> parent won't change, but maybe it should take the signal lock?
>
> So the above very much is *not* meant to be a "do it like this", more
> of a "this direction, maybe"?
>
> The existing code is definitely broken. It special-cases core-dumping
> probably simply because that's the only case people had realized, and
> not thought of the execve() thing.


I don't see how there can be a complete solution in may_ptrace_stop.

a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.

Those are the defined semantics and I believe it is something
as common as strace that depends on them.

b) Even if we added a test for our ptrace parent blocking in a ptrace
attach of an ongoing exec, it still wouldn't help.

That ptrace attach could legitimately come after the thread in
question has stopped and notified it's parent it is stopped.



None of this is to say I like the semantics of PTRACE_EVENT_EXIT. It is
just we will violate the no regressions rule if we don't stop there
during exec.

The normal case is that the strace or whomever is already attached to
all of the threads during exec and no deadlock occurs. So the current
behavior is quite usable.



Maybe my memory is wrong that userspace cares but I really don't think
so.


Eric

2020-04-09 17:23:27

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1


On 4/9/20 7:03 PM, Eric W. Biederman wrote:
>
> Adding Oleg to the conversation if for no other reason that he can see
> it is happening.
>
> Oleg has had a test case where this can happen for years and nothing
> has come out as an obvious proper fix for this deadlock issue.
>

Just for reference, I used Oleg's test case,
and improved it a bit. The test case anticipates the
EAGAIN return code from PTRACE_ATTACH. This is likely
to change somehow.
If Linus's idea works, you will probably have to
look at adjusting the test expectations again.

I would still be surprised if any other solution works.


Bernd.

> Linus Torvalds <[email protected]> writes:
>
>> On Thu, Apr 9, 2020 at 9:15 AM Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> may_ptrace_stop() is supposed to stop the blocking exactly so that it
>>> doesn't deadlock.
>>>
>>> I wonder why that doesn't work..
>>>
>>> [ Goes and look ]
>>>
>>> Oh. I see.
>>>
>>> That ptrace_may_stop() only ever considered core-dumping, not execve().
>>>
>>> But if _that_ is the reason for the deadlock, then it's trivially fixed.
>>
>> So maybe may_ptrace_stop() should just do something like this
>> (ENTIRELY UNTESTED):
>>
>> struct task_struct *me = current, *parent = me->parent;
>>
>> if (!likely(me->ptrace))
>> return false;
>>
>> /* If the parent is exiting or core-dumping, it's not
>> listening to our signals */
>> if (parent->signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP))
>> return false;
>>
>> /* if the parent is going through a execve(), it's not listening */
>> if (parent->signal->group_exit_task)
>> return false;
>>
>> return true;
>>
>> instead of the fairly ad-hoc tests for core-dumping.
>>
>> The above is hand-wavy - I didn't think a lot about locking.
>> may_ptrace_stop() is already called under the tasklist_lock, so the
>> parent won't change, but maybe it should take the signal lock?
>>
>> So the above very much is *not* meant to be a "do it like this", more
>> of a "this direction, maybe"?
>>
>> The existing code is definitely broken. It special-cases core-dumping
>> probably simply because that's the only case people had realized, and
>> not thought of the execve() thing.
>
>
> I don't see how there can be a complete solution in may_ptrace_stop.
>
> a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.
>
> Those are the defined semantics and I believe it is something
> as common as strace that depends on them.
>
> b) Even if we added a test for our ptrace parent blocking in a ptrace
> attach of an ongoing exec, it still wouldn't help.
>
> That ptrace attach could legitimately come after the thread in
> question has stopped and notified it's parent it is stopped.
>
>
>
> None of this is to say I like the semantics of PTRACE_EVENT_EXIT. It is
> just we will violate the no regressions rule if we don't stop there
> during exec.
>
> The normal case is that the strace or whomever is already attached to
> all of the threads during exec and no deadlock occurs. So the current
> behavior is quite usable.
>
>
>
> Maybe my memory is wrong that userspace cares but I really don't think
> so.
>
>
> Eric
>

2020-04-09 17:37:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 10:06 AM Eric W. Biederman <[email protected]> wrote:
>
> a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.
>
> Those are the defined semantics and I believe it is something
> as common as strace that depends on them.

Don't be silly.

Of course we must stop IF THE TRACER IS ACTUALLY TRACING US.

But that's simply not the case. The deadlock case is where the tracer
is going through an execve, and the tracing thread is being killed.

Claiming that "user space breaks" is garbage. User space cannot care.
In fact, it's broken right now because it deadlocks, but it deadlocks
becvause that code waits for absolutely no good reason.

> b) Even if we added a test for our ptrace parent blocking in a ptrace
> attach of an ongoing exec, it still wouldn't help.
>
> That ptrace attach could legitimately come after the thread in
> question has stopped and notified it's parent it is stopped.

What?

The whole point is that the tracer _is_ the thing going through
execve(), which is why you get the deadlock in the first place.

You make no sense.

If the tracer is somebody else, we wouldn't be deadlocking. We'd just
be tracing.

I really don't understand your arguments against my patch. They seem
entirely nonsensical. Are we speaking past each other some way?

Linus

2020-04-09 17:40:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 10:17 AM Bernd Edlinger
<[email protected]> wrote:
>
> Just for reference, I used Oleg's test case,
> and improved it a bit.

I'm sure the test-case got posted somewhere, but mind sending it to me
(or just pointing me at it?)

Linus

2020-04-09 17:47:21

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/9/20 7:37 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 10:17 AM Bernd Edlinger
> <[email protected]> wrote:
>>
>> Just for reference, I used Oleg's test case,
>> and improved it a bit.
>
> I'm sure the test-case got posted somewhere, but mind sending it to me
> (or just pointing me at it?)
>


No problem looks like you already swallowed the hook :-)
look at the following commit in your tree:

commit 2de4e82318c7f9d34f4b08599a612cd4cd10bf0b
Author: Bernd Edlinger <[email protected]>
Date: Fri Mar 20 21:26:19 2020 +0100

selftests/ptrace: add test cases for dead-locks

This adds test cases for ptrace deadlocks.

Additionally fixes a compile problem in get_syscall_info.c,
observed with gcc-4.8.4:

get_syscall_info.c: In function 'get_syscall_info':
get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
allowed in C99 mode
for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
^
get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
your code

Signed-off-by: Bernd Edlinger <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>


Test case 1/2 is working 2/2 is failing, deadlocking,
I think even the time-out handler does not kill the dead-lock
if I remember correctly.

And sorry, I anticipated part 15/16 and 16/16 would be pulled at the
same time, so the glitch would not be visible by now.


Bernd.

> Linus
>

2020-04-09 18:40:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 10:46 AM Bernd Edlinger
<[email protected]> wrote:
>
> Test case 1/2 is working 2/2 is failing, deadlocking,
> I think even the time-out handler does not kill the dead-lock
> if I remember correctly.

Ok, I get

[==========] Running 2 tests from 1 test cases.
[ RUN ] global.vmaccess
[ OK ] global.vmaccess
[ RUN ] global.attach
global.attach: Test terminated by timeout
[ FAIL ] global.attach
[==========] 1 / 2 tests passed.
[ FAILED ]

but reading that test it's not doing what I thought it was doing.

I thought it was doing the ptrace from within a thread. But it's doing
a proper fork() and doing the attach from the parent, just doing the
TRACEME from a thread that exits.

I guess I need to look at what that test is actually testing, because
it wasn't what I thought.

Linus

2020-04-09 19:44:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 11:36 AM Linus Torvalds
<[email protected]> wrote:
>
> I guess I need to look at what that test is actually testing, because
> it wasn't what I thought.

Ahh.

The problem is that zap_other_threads() counts all threads.

But it doesn't bother notifying already dead threads, even if it counts them.

And then it waits for the threads to go away, but didn't do anything
to make that dead thread go away.

And the test case has an already dead thread that is just waiting to
be reaped by the same person who is now waiting for it to go away.

So it just stays around.

Honestly, I'm not entirely sure this is worth worrying about, since
it's all killable anyway and only happens if you do something stupid.

I mean, you can get two threads to wait for each other more easily other ways.

Or maybe we just shouldn't count already dead threads? Yeah, they'd
share that current signal struct, but they're dead and can't do
anything about it, they can only be reaped.

But that would mean that we should also move the signal->notify_count
update to when we mark the EXIT_ZOMBIE or EXIT_DEAD in exit_state.

Linus

2020-04-09 19:59:16

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/9/20 9:42 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 11:36 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> I guess I need to look at what that test is actually testing, because
>> it wasn't what I thought.
>
> Ahh.
>
> The problem is that zap_other_threads() counts all threads.
>
> But it doesn't bother notifying already dead threads, even if it counts them.
>
> And then it waits for the threads to go away, but didn't do anything
> to make that dead thread go away.
>
> And the test case has an already dead thread that is just waiting to
> be reaped by the same person who is now waiting for it to go away.
>
> So it just stays around.
>
> Honestly, I'm not entirely sure this is worth worrying about, since
> it's all killable anyway and only happens if you do something stupid.
>

The use case where this may happen with strace
when you call strace with lots of -p <pid> arguments,
and one of them is a bomb. strace stuck.

So when that happens in the beginning, it is not much
work lost, but if you traced a megabyte of data to analyze
and then that happens, you are not really amused.

Also slightly different things happen with PTRACE_O_TRACEEXIT
then the tracer is supposed to continue the exit, and then
to wait for the thread to die. Which is twice as ugly...


Bernd.


> I mean, you can get two threads to wait for each other more easily other ways.
>
> Or maybe we just shouldn't count already dead threads? Yeah, they'd
> share that current signal struct, but they're dead and can't do
> anything about it, they can only be reaped.
>
> But that would mean that we should also move the signal->notify_count
> update to when we mark the EXIT_ZOMBIE or EXIT_DEAD in exit_state.
>
> Linus
>

2020-04-09 20:09:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 12:57 PM Bernd Edlinger
<[email protected]> wrote:
>
> The use case where this may happen with strace
> when you call strace with lots of -p <pid> arguments,
> and one of them is a bomb. strace stuck.

Yeah, so from a convenience angle I do agree that it would be nicer to
just not count dead threads.

You can test that by just moving the

/* Don't bother with already dead threads */
if (t->exit_state)
continue;

test in zap_other_threads() to above the

count++;

line instead.

NOTE! That is *NOT* the correct true fix. I'm just suggesting that you
try if it fixes that particular test-case (I did not try it myself -
because .. lazy)

If Oleg agrees that we could take the approach that we can share a
signal struct with dead threads, we'd also need to change the
accounting to do that notify_count not when the signal struct is
unlinked, but when exit_state is first set.

I'm not convinced that's the right solution, but I do agree that it's
annoying how easily strace can get stuck, since one of the main uses
for strace is for debugging nasty situations.

Linus

2020-04-09 20:41:11

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/9/20 10:04 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 12:57 PM Bernd Edlinger
> <[email protected]> wrote:
>>
>> The use case where this may happen with strace
>> when you call strace with lots of -p <pid> arguments,
>> and one of them is a bomb. strace stuck.
>
> Yeah, so from a convenience angle I do agree that it would be nicer to
> just not count dead threads.
>
> You can test that by just moving the
>
> /* Don't bother with already dead threads */
> if (t->exit_state)
> continue;
>
> test in zap_other_threads() to above the
>
> count++;
>
> line instead.
>
> NOTE! That is *NOT* the correct true fix. I'm just suggesting that you

Eric, I think he means you, I am too busy with other work ;-) right now.

> try if it fixes that particular test-case (I did not try it myself -
> because .. lazy)
>
> If Oleg agrees that we could take the approach that we can share a
> signal struct with dead threads, we'd also need to change the
> accounting to do that notify_count not when the signal struct is
> unlinked, but when exit_state is first set.
>
> I'm not convinced that's the right solution, but I do agree that it's
> annoying how easily strace can get stuck, since one of the main uses
> for strace is for debugging nasty situations.
>
> Linus
>

2020-04-09 21:05:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Linus Torvalds <[email protected]> writes:

> On Thu, Apr 9, 2020 at 12:57 PM Bernd Edlinger
> <[email protected]> wrote:
>>
>> The use case where this may happen with strace
>> when you call strace with lots of -p <pid> arguments,
>> and one of them is a bomb. strace stuck.
>
> Yeah, so from a convenience angle I do agree that it would be nicer to
> just not count dead threads.
>
> You can test that by just moving the
>
> /* Don't bother with already dead threads */
> if (t->exit_state)
> continue;
>
> test in zap_other_threads() to above the
>
> count++;
>
> line instead.

That looks like a legitimate race, and something worth addressing. It
doesn't look like t->exit_state has siglock protection so I don't think
testing it under siglock would fix that race. But something like that
certainly should.

But no. While you are goind a good job at spotting odd corner
cases that need to be fixed. This also is not the cause of the
deadlock. It is nothing that subtle.

Eric

2020-04-09 21:19:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 2:03 PM Eric W. Biederman <[email protected]> wrote:
>
> But no. While you are goind a good job at spotting odd corner
> cases that need to be fixed. This also is not the cause of the
> deadlock. It is nothing that subtle.

So Eric, I'm now going to stop wasting my time on arguing with you.

Since both you and Bernd claimed to be too busy to even bother testing
that thing, I just built it and booted it.

And guess what? That thing makes your non-deadlock thing go away.

So it's _literally_ that simple.

Now, does it make the tests "pass"? No.

Because the "vmaccess" test fails because the open() now fails -
because we simply don't wait for that dead thread any more, so the
/proc/<pid>/mem thing doesn't exist.

And for the same reason that "attach" test now no longer returns
EAGAIN, it just attaches to the remaining execlp thing instead.

So I'm not just good at "spotting odd corner cases". I told you why
that bogus deadlock of yours failed - the execve was pointlessly
waiting for a dead thread that had marked itself ptraced, and nobody
was reaping it.

And it appears you were too lazy to even try it out.

Yes, that whole "notify_dead" count vs "tsk->exit_state" test is
fundamentally racy. But that race happens to be irrelevant for the
test case in question.

So until you can actually add something to the discussion, I'm done
with this thread.

Linus

2020-04-09 21:37:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Linus Torvalds <[email protected]> writes:

> On Thu, Apr 9, 2020 at 10:06 AM Eric W. Biederman <[email protected]> wrote:
>>
>> a) We must stop in PTRACE_EVENT_EXIT during exec or userspace *breaks*.
>>
>> Those are the defined semantics and I believe it is something
>> as common as strace that depends on them.
>
> Don't be silly.
>
> Of course we must stop IF THE TRACER IS ACTUALLY TRACING US.
>
> But that's simply not the case. The deadlock case is where the tracer
> is going through an execve, and the tracing thread is being killed.

Linus please don't be daft.

I will agree that if one thread in a process ptracess another thread
in the same process, and the tracing thread calls execve we have
a problem. A different problem but one worth addressing.




The deadlock case I am talking about. The deadlock case that trivially
exists in real code is:

A single threaded process (the tracer) ptrace attaches to every thread of a
multi-threaded process (the tracee).

If one of these attaches succeeds, and another thread of the tracee
processes calls execve before the tracer attachs to it, then the tracer
blocks in ptrace_attach waitiing for the traccee's exeve to succeed
while the tracee blocks in de_thread waiting for it's other threads to
exit. The threads of the tracee attempt to exit but one or more of them
are in PTRACE_EVENT_EXIT waiting for the tracer to let them continue.

The tracer of course is stalled waiting for the exec to succeed.


Let me see if I can draw a picture.




Tracer TraceeThreadA TraceeThreadB
ptrace_attach(TraceeThreadA)
execve
acquires cred_guard_mutex
ptrace_attach(TraceeThreadB)
Blocks on cred_guard_mutex
de_thread
waits for other threads to exit
Receives SIGKILL
do_exit()
PTRACE_EVENT_EXIT
Waits for tracer


So we have a loop.

TraceeThreadB is waiting for TraceeThreadA to reach exit_noitfy.
TraceeThreadA is waiting for the tracer to allow it to continue.
The Tracer is waiting for TraceeThreadB to finish it's call to exec.

Since they are all waiting for each other that loop is a deadlock.

All it takes is a tracer that uses PTRACE_EVENT_EXIT.

Does that make the deadlock that I see clear?


In your proposed lock revision you were talking about ptrace_attach
taking your new the lock for write so I don't see your proposed lock
being any different in this scenario from cred_guard_mutex. Perhaps I
missed something?

I know Oleg's test case was a little more involved but that was to
guarantee the timing perhaps that introduced confusion.

Eric

2020-04-09 23:55:47

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/9/20 11:17 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 2:03 PM Eric W. Biederman <[email protected]> wrote:
>>
>> But no. While you are goind a good job at spotting odd corner
>> cases that need to be fixed. This also is not the cause of the
>> deadlock. It is nothing that subtle.
>
> So Eric, I'm now going to stop wasting my time on arguing with you.
>
> Since both you and Bernd claimed to be too busy to even bother testing
> that thing, I just built it and booted it.
>
> And guess what? That thing makes your non-deadlock thing go away.
>
> So it's _literally_ that simple.
>

You known I was right from the beginning :-) :-) (-: (-:,
I said you would have to adjust the test. I only thought of the
second part, so that is were I was wrong.

Yeah Thanks. My real problem is called OpenSSL 3.0 + FIPS and it feels
like a very big pain in the ass......

But please tell nobody. That is a secret :-)


Thanks
Bernd.

> Now, does it make the tests "pass"? No.
>
> Because the "vmaccess" test fails because the open() now fails -
> because we simply don't wait for that dead thread any more, so the
> /proc/<pid>/mem thing doesn't exist.
>
> And for the same reason that "attach" test now no longer returns
> EAGAIN, it just attaches to the remaining execlp thing instead.
>
> So I'm not just good at "spotting odd corner cases". I told you why
> that bogus deadlock of yours failed - the execve was pointlessly
> waiting for a dead thread that had marked itself ptraced, and nobody
> was reaping it.
>
> And it appears you were too lazy to even try it out.
>
> Yes, that whole "notify_dead" count vs "tsk->exit_state" test is
> fundamentally racy. But that race happens to be irrelevant for the
> test case in question.
>
> So until you can actually add something to the discussion, I'm done
> with this thread.
>
> Linus
>

2020-04-10 00:48:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 2:17 PM Linus Torvalds
<[email protected]> wrote:
>
> So I'm not just good at "spotting odd corner cases". I told you why
> that bogus deadlock of yours failed - the execve was pointlessly
> waiting for a dead thread that had marked itself ptraced, and nobody
> was reaping it.

Side note: this does seem to be the only special case at least in this
particular area for execve().

The thing is, for actual _live_ threads that end up waiting for
ptracers, the SIGKILL that zap_other_threads() does should punch
through everything - it should not just break out anybody waiting for
the mutex, it should also break any "wait for tracer" ptrace waiting
conditions.

But already dead threads are special. Unlike SIGKILL'ed other threads,
they can stay around almost forever.

Because they're in Zombie state, and no amount of SIGKILL will do
anything to them (and zap_other_threads() didn't even try), and
they'll stay around until they are reaped by some entirely independent
party (which just happened to be the same one that was doing the
ptrace, but doesn't really have to be)

(Of course, who knows what other special cases there might be - I'm
not saying this is the _only_ special case, but in the particular area
of 'zap other threads and wait for them', already dead threads do seem
to be special).

So the fact that "zap_threads()" counts dead threads - but cannot do
anything about them - is fundamentally different, and that's why that
particular test-case has that odd behavior.

So I think we have basically three choices:

(1) have execve() not wait for dead threads while holding the cred
mutex (that's effectively what that zap_threads() hacky patch does,
but it's not really correct because it can cause notify_count
underflows)

(2) have zap_other_threads() force-reap Zombie threads

(3) say that it's a user space bug, and if you're doing PTRACE_ATTACH
you need to make sure there are no dead threads of the target that
might be blocking an execve().

For an example of that (3) approach: making the test-case just do a
waitpid() before the PTRACE_ATTACH will unhang it, because it reaps
that thread that did the PTRACE_TRACEME.

So option (3) is basically saying "that test-case is buggy, exactly
like the two readers on a pipe".

But I continued to look at (1), but trying to deal with the fact that
"notify_count" will get decremented not just by live threads, but by
the ones that already died too.

Instead of trying to change how notify_count gets decremented, we
could do something like the attached patch: wait for it to go down to
zero, yes, but go back and re-check until you don't have to wait any
more. That should fix the underflow situation. The comment says it
all.

This patch is 100% untested. It probably compiles, that's all I'll
say. I'm not proud of it. And I may be missing some important thing
(and I'm not happy about the magical "am I not the
thread_group_leader?" special case etc).

It's worth noting that this code is the only one that cares about the
return value of zap_other_threads(), so changing the semantics to only
count non-dead threads is safe in that sense.

Whether it's safe to then share the signal structure ever the rest of
exevbe() - even if it's only with dead threads - I didn't check.

I think Oleg might be the only person alive who understands all of our
process exit code.

Oleg? Comments?

Linus

2020-04-11 18:22:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Eric, Linus, et al,

by various reasons I have not been reading emails for last weeks,
I'll try to read this thread tomorrow, currently I am a bit lost.

On 04/09, Linus Torvalds wrote:
>
> (1) have execve() not wait for dead threads while holding the cred
> mutex

This is what I tried to do 3 years ago, see

[PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held
https://lore.kernel.org/lkml/[email protected]/

yes, yes, yes, the patch is not pretty.

From your another email:

> /* if the parent is going through a execve(), it's not listening */
> if (parent->signal->group_exit_task)
return false;

Heh ;) see

[PATCH 2/2] ptrace: ensure PTRACE_EVENT_EXIT won't stop if the tracee is killed by exec
https://lore.kernel.org/lkml/[email protected]/

from the same thread.

But this change is more problematic.

Oleg.

2020-04-11 18:30:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Sat, Apr 11, 2020 at 11:21 AM Oleg Nesterov <[email protected]> wrote:
>
> On 04/09, Linus Torvalds wrote:
> >
> > (1) have execve() not wait for dead threads while holding the cred
> > mutex
>
> This is what I tried to do 3 years ago, see

Well, you did it differently - by moving the "wait for dead threads"
logic to after releasing the lock.

My simpler patch was lazier - just don't wait for dead threads at all,
since they are dead and not interesting.

Because even if it's Easter weekend, those threads are not coming back
to life ;)

You do say in that old patch that we can't just share the signal
state, but I wonder how true that is. Sharing it with a TASK_ZOMBIE
doesn't seem all that problematic to me. The only thing that can do is
getting reaped by a later wait.

That said, I actually am starting to think that maybe execve() should
just try to reap those threads instead, and avoid the whole issue that
way. Basically my "option (2)" thing.

Sure, that's basically stealing them from the parent, but 'execve()'
really is special wrt threads, and the parent still has the execve()
thread itself. And it's not so different from SIGKILL, which also
forcibly breaks off any ptracer etc without anybody being able to say
anything about it.

Linus

2020-04-11 18:33:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Sat, Apr 11, 2020 at 11:29 AM Linus Torvalds
<[email protected]> wrote:
>
> Well, you did it differently - by moving the "wait for dead threads"
> logic to after releasing the lock.

Not that I mind that approach either - the less work we do inside that
lock, the better off we are..

Linus

2020-04-11 19:16:54

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/11/20 8:29 PM, Linus Torvalds wrote:
> On Sat, Apr 11, 2020 at 11:21 AM Oleg Nesterov <[email protected]> wrote:
>>
>> On 04/09, Linus Torvalds wrote:
>>>
>>> (1) have execve() not wait for dead threads while holding the cred
>>> mutex
>>
>> This is what I tried to do 3 years ago, see
>
> Well, you did it differently - by moving the "wait for dead threads"
> logic to after releasing the lock.
>
> My simpler patch was lazier - just don't wait for dead threads at all,
> since they are dead and not interesting.

But won't the dead thread's lifetime overlap the new thread's lifetime
from the tracer's POV?


Bernd.

>
> Because even if it's Easter weekend, those threads are not coming back
> to life ;)
>
> You do say in that old patch that we can't just share the signal
> state, but I wonder how true that is. Sharing it with a TASK_ZOMBIE
> doesn't seem all that problematic to me. The only thing that can do is
> getting reaped by a later wait.
>
> That said, I actually am starting to think that maybe execve() should
> just try to reap those threads instead, and avoid the whole issue that
> way. Basically my "option (2)" thing.
>
> Sure, that's basically stealing them from the parent, but 'execve()'
> really is special wrt threads, and the parent still has the execve()
> thread itself. And it's not so different from SIGKILL, which also
> forcibly breaks off any ptracer etc without anybody being able to say
> anything about it.
>
> Linus
>

2020-04-11 20:12:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Sat, Apr 11, 2020 at 12:15 PM Bernd Edlinger
<[email protected]> wrote:
>
> But won't the dead thread's lifetime overlap the new thread's lifetime
> from the tracer's POV?

What new thread?

execve() doesn't create any new thread.

But yes, an external tracer could see the (old) thread that did
execve() do new system calls before it sees the (other old) thread
that was a zombie.

But that is already somethign that can happen, simply because the
events aren't ordered. The whole issue is that the zombie thread
already died, but the tracer just didn't bother to read that state
change.

So it's not that the dead thread somehow _dies_ after the execve(). It
already died.

It's just that whoever is to reap it (or traces it) just hasn't cared
to read the status of that thing yet.

Linus

2020-04-11 21:20:10

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/11/20 10:07 PM, Linus Torvalds wrote:
> On Sat, Apr 11, 2020 at 12:15 PM Bernd Edlinger
> <[email protected]> wrote:
>>
>> But won't the dead thread's lifetime overlap the new thread's lifetime
>> from the tracer's POV?
>
> What new thread?
>
> execve() doesn't create any new thread.
>
> But yes, an external tracer could see the (old) thread that did
> execve() do new system calls before it sees the (other old) thread
> that was a zombie.
>

That is an api change. Previously the strace could rely that there
is a callback at the end of the execve and that all previous threads
are de-zombiefied and waited for.

Then there is a execve done event.

And then the old thread continues to run but executing the new program.

I'd bet the strace test suite has tests for that order of events,
or at least it should.


> But that is already somethign that can happen, simply because the
> events aren't ordered. The whole issue is that the zombie thread
> already died, but the tracer just didn't bother to read that state
> change.

What causes the deadlock is that de_thread waits for the tracer to
wait on the threads. If that changes something will break in the
user space. Of course you could say, I did not say "Simon says".


Bernd.

>
> So it's not that the dead thread somehow _dies_ after the execve(). It
> already died.
>
> It's just that whoever is to reap it (or traces it) just hasn't cared
> to read the status of that thing yet.
>
> Linus
>

2020-04-13 08:37:37

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 04/11, Linus Torvalds wrote:
>
> On Sat, Apr 11, 2020 at 11:21 AM Oleg Nesterov <[email protected]> wrote:
> >
> > On 04/09, Linus Torvalds wrote:
> > >
> > > (1) have execve() not wait for dead threads while holding the cred
> > > mutex
> >
> > This is what I tried to do 3 years ago, see
>
> Well, you did it differently - by moving the "wait for dead threads"
> logic to after releasing the lock.

Yes, please see below.

> My simpler patch was lazier

To be honest, I don't understand it... OK, suppose that the main thread
M execs and zap_other_threads() finds a single (and alive) sub-thread T,
sig->notify_count = 1.

If T is traced, then ->notify_count won't be decremented until the tracer
reaps this task, so we have the same problem.

This is fixeable, say, we can uglify exit_notify() like my patch does,
but:

> - just don't wait for dead threads at all,
> since they are dead and not interesting.

Well, I am not sure. Just for example, seccomp(SECCOMP_FILTER_FLAG_TSYNC)
can fail after mt-exec because seccomp_can_sync_threads() finds a zombe
thread. Sure, this too can can be fixed, but I think there should be no
other threads after exec.

And:

> You do say in that old patch that we can't just share the signal
> state, but I wonder how true that is.

We can share sighand_struct with TASK_ZOMBIE's. The problem is that
we can not unshare ->sighand until they go away, execing thread and
zombies must use the same sighand->siglock to serialize the access to
->thread_head/etc.

OK, we probably can if we complicate unshare_sighand(), we will need
to take tasklist_lock/oldsighand->siglock unconditionally to check
oldsighand->count > sig->nr_thread, then do

for_each_thread(current, t) {
t->sighand = newsighand;
__cleanup_sighand(oldsighand);
}

but see above, I don't think this makes any sense.

Oleg