2019-07-22 17:27:18

by Christian Brauner

[permalink] [raw]
Subject: [GIT PULL] pidfd fixes

Hi Linus,

This contains a fix for pidfd polling. It ensures that the task's exit
state is visible to all waiters:

The following changes since commit 5f9e832c137075045d15cd6899ab0505cfb2ca4b:

Linus 5.3-rc1 (2019-07-21 14:05:38 -0700)

are available in the Git repository at:

[email protected]:pub/scm/linux/kernel/git/brauner/linux tags/for-linus-20190722

for you to fetch changes up to b191d6491be67cef2b3fa83015561caca1394ab9:

pidfd: fix a poll race when setting exit_state (2019-07-22 16:02:03 +0200)

/* Summary */
The pidfd polling code suffered from a race condition. A waiter could be
notified via do_notify_pidfd() without the task's exit state being set and
thus not visible to the waiter. This would cause the waiter to be blocked
indefinitely. The following schematic illustrates how this could happen:

CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
pidfd_poll
if (tsk->exit_state)
tsk->exit_state = EXIT_DEAD

This is fixed by ensuring that the task's exit state is set before calling
into do_notify_pidfd().

Please consider pulling from the signed for-linus-20190722 tag.

Thanks!
Christian

----------------------------------------------------------------
for-linus-20190722

----------------------------------------------------------------
Suren Baghdasaryan (1):
pidfd: fix a poll race when setting exit_state

kernel/exit.c | 1 +
1 file changed, 1 insertion(+)


2019-07-22 17:42:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] pidfd fixes

On Mon, Jul 22, 2019 at 7:26 AM Christian Brauner <[email protected]> wrote:
>
> This contains a fix for pidfd polling. It ensures that the task's exit
> state is visible to all waiters:

Hmm.

I've pulled this, but the exit_state thing has been very fragile
before, and I'm not entirely happy with how this just changes where it
is set. I guess the movement here is all inside the tasklist_lock, so
it's not that big of a deal, but still..

I would *really* like Oleg to take a look.

Also, and the primary reason I write this email is that this basically
makes the "EXIT_ZOMBIE / EXIT_DEAD" state handling look all kinds of
crazy. You set it to EXIT_ZOMBIE potentially _twice_. Whaa?

So if we set EXIT_ZOMBIE early, then I think we should change the
EXIT_DEAD case too. IOW, do something like this on top:

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -734,9 +734,10 @@ static void exit_notify(struct task_struct
*tsk, int group_dead)
autoreap = true;
}

- tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
- if (tsk->exit_state == EXIT_DEAD)
+ if (autoreap) {
+ tsk->exit_state = EXIT_DEAD;
list_add(&tsk->ptrace_entry, &dead);
+ }

/* mt-exec, de_thread() is waiting for group leader */
if (unlikely(tsk->signal->notify_count < 0))

where now the logic becomes "ok, we turned into a zombie above, and if
we autoreap this thread then we turn the zombie into a fully dead
thread".

Because currently we end up having "first turn it into a zombie", then
"set it to zombie or dead depending on autoreap" and then "if we
turned it into dead, move it to the dead list".

That just feels confused to me. Comments?

Linus

2019-07-22 21:14:05

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL] pidfd fixes

On Mon, Jul 22, 2019 at 09:27:08AM -0700, Linus Torvalds wrote:
> On Mon, Jul 22, 2019 at 7:26 AM Christian Brauner <[email protected]> wrote:
> >
> > This contains a fix for pidfd polling. It ensures that the task's exit
> > state is visible to all waiters:
>
> Hmm.
>
> I've pulled this, but the exit_state thing has been very fragile
> before, and I'm not entirely happy with how this just changes where it
> is set. I guess the movement here is all inside the tasklist_lock, so
> it's not that big of a deal, but still..
>
> I would *really* like Oleg to take a look.

Oh, sorry. Oleg did take a look. See:

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

>
> Also, and the primary reason I write this email is that this basically
> makes the "EXIT_ZOMBIE / EXIT_DEAD" state handling look all kinds of
> crazy. You set it to EXIT_ZOMBIE potentially _twice_. Whaa?
>
> So if we set EXIT_ZOMBIE early, then I think we should change the
> EXIT_DEAD case too. IOW, do something like this on top:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -734,9 +734,10 @@ static void exit_notify(struct task_struct
> *tsk, int group_dead)
> autoreap = true;
> }
>
> - tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> - if (tsk->exit_state == EXIT_DEAD)
> + if (autoreap) {
> + tsk->exit_state = EXIT_DEAD;
> list_add(&tsk->ptrace_entry, &dead);
> + }
>
> /* mt-exec, de_thread() is waiting for group leader */
> if (unlikely(tsk->signal->notify_count < 0))
>
> where now the logic becomes "ok, we turned into a zombie above, and if
> we autoreap this thread then we turn the zombie into a fully dead
> thread".
>
> Because currently we end up having "first turn it into a zombie", then
> "set it to zombie or dead depending on autoreap" and then "if we
> turned it into dead, move it to the dead list".
>
> That just feels confused to me. Comments?

Agreed. But that codepath is so core-kernel that I really felt more
comfortable just doing the absolut minimal thing so that when things
bite us we see it right away.

There's no harm in sending a cleanup for this later I think, when we
haven't hit any weirdness with the current change. Does that sound ok?

Christian

2019-07-22 21:15:24

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] pidfd fixes

The pull request you sent on Mon, 22 Jul 2019 16:22:41 +0200:

> [email protected]:pub/scm/linux/kernel/git/brauner/linux tags/for-linus-20190722

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/44b912cd0b55777796c5ae8ae857bd1d5ff83ed5

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

2019-07-23 20:04:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [GIT PULL] pidfd fixes

On 07/22, Linus Torvalds wrote:
>
> So if we set EXIT_ZOMBIE early, then I think we should change the
> EXIT_DEAD case too. IOW, do something like this on top:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -734,9 +734,10 @@ static void exit_notify(struct task_struct
> *tsk, int group_dead)
> autoreap = true;
> }
>
> - tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> - if (tsk->exit_state == EXIT_DEAD)
> + if (autoreap) {
> + tsk->exit_state = EXIT_DEAD;
> list_add(&tsk->ptrace_entry, &dead);
> + }

Yes, this needs cleanups. Actually I was going to suggest another change
below, this way do_notify_pidfd() is only called when it is really needed.
But then I decided a trivial one-liner makes more sense for the start.

I'll try to think. Perhaps we should also change do_notify_parent() to set
p->exit_state, at least if autoreap. Then the early exit_state = EXIT_ZOMBIE
won't look so confusing and we can do more (minor) cleanups.

Oleg.

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -182,6 +182,13 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
put_task_struct(tsk);
}

+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ pid = task_pid(task);
+ wake_up_all(&pid->wait_pidfd);
+}

void release_task(struct task_struct *p)
{
@@ -218,6 +225,8 @@ void release_task(struct task_struct *p)
zap_leader = do_notify_parent(leader, leader->exit_signal);
if (zap_leader)
leader->exit_state = EXIT_DEAD;
+
+ do_notify_pidfd(leader);
}

write_unlock_irq(&tasklist_lock);
@@ -710,7 +719,7 @@ static void forget_original_parent(struct task_struct *father,
*/
static void exit_notify(struct task_struct *tsk, int group_dead)
{
- bool autoreap;
+ bool autoreap, xxx;
struct task_struct *p, *n;
LIST_HEAD(dead);

@@ -720,23 +729,22 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
if (group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);

- if (unlikely(tsk->ptrace)) {
- int sig = thread_group_leader(tsk) &&
- thread_group_empty(tsk) &&
- !ptrace_reparented(tsk) ?
- tsk->exit_signal : SIGCHLD;
+ autoreap = true;
+ xxx = thread_group_leader(tsk) && thread_group_empty(tsk);
+
+ if (xxx || unlikely(tsk->ptrace)) {
+ int sig = xxx && !ptrace_reparented(tsk)
+ ? tsk->exit_signal : SIGCHLD;
autoreap = do_notify_parent(tsk, sig);
- } else if (thread_group_leader(tsk)) {
- autoreap = thread_group_empty(tsk) &&
- do_notify_parent(tsk, tsk->exit_signal);
- } else {
- autoreap = true;
}

tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
if (tsk->exit_state == EXIT_DEAD)
list_add(&tsk->ptrace_entry, &dead);

+ if (xxx)
+ do_notify_pidfd(tsk);
+
/* mt-exec, de_thread() is waiting for group leader */
if (unlikely(tsk->signal->notify_count < 0))
wake_up_process(tsk->signal->group_exit_task);
--- x/kernel/signal.c
+++ x/kernel/signal.c
@@ -1881,14 +1881,6 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
return ret;
}

-static void do_notify_pidfd(struct task_struct *task)
-{
- struct pid *pid;
-
- pid = task_pid(task);
- wake_up_all(&pid->wait_pidfd);
-}
-
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1912,9 +1904,6 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));

- /* Wake up all pidfd waiters */
- do_notify_pidfd(tsk);
-
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.

2019-07-23 20:05:16

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL] pidfd fixes

On Tue, Jul 23, 2019 at 12:12:49PM +0200, Oleg Nesterov wrote:
> On 07/22, Linus Torvalds wrote:
> >
> > So if we set EXIT_ZOMBIE early, then I think we should change the
> > EXIT_DEAD case too. IOW, do something like this on top:
> >
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -734,9 +734,10 @@ static void exit_notify(struct task_struct
> > *tsk, int group_dead)
> > autoreap = true;
> > }
> >
> > - tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> > - if (tsk->exit_state == EXIT_DEAD)
> > + if (autoreap) {
> > + tsk->exit_state = EXIT_DEAD;
> > list_add(&tsk->ptrace_entry, &dead);
> > + }
>
> Yes, this needs cleanups. Actually I was going to suggest another change
> below, this way do_notify_pidfd() is only called when it is really needed.
> But then I decided a trivial one-liner makes more sense for the start.

Yeah, that was my thinking exactly.

>
> I'll try to think. Perhaps we should also change do_notify_parent() to set
> p->exit_state, at least if autoreap. Then the early exit_state = EXIT_ZOMBIE
> won't look so confusing and we can do more (minor) cleanups.

You mind sending that as a proper patch?
I also have a few more cleanups for other parts I intend to send later
this week. I'd pick this one up too.

>
> Oleg.
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -182,6 +182,13 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
> put_task_struct(tsk);
> }
>
> +static void do_notify_pidfd(struct task_struct *task)
> +{
> + struct pid *pid;
> +
> + pid = task_pid(task);
> + wake_up_all(&pid->wait_pidfd);
> +}
>
> void release_task(struct task_struct *p)
> {
> @@ -218,6 +225,8 @@ void release_task(struct task_struct *p)
> zap_leader = do_notify_parent(leader, leader->exit_signal);
> if (zap_leader)
> leader->exit_state = EXIT_DEAD;
> +
> + do_notify_pidfd(leader);
> }
>
> write_unlock_irq(&tasklist_lock);
> @@ -710,7 +719,7 @@ static void forget_original_parent(struct task_struct *father,
> */
> static void exit_notify(struct task_struct *tsk, int group_dead)
> {
> - bool autoreap;
> + bool autoreap, xxx;

In case you don't mind the length of it how about:

bool autoreap, leading_empty_thread_group;

It's not the prettiest names but having rather descriptive var names in
these codepaths seems like a good idea to me.
It also reads very naturally further below:

if (leading_empty_thread_group || unlikely(tsk->ptrace)) {
int sig = leading_empty_thread_group && !ptrace_reparented(tsk)
? tsk->exit_signal : SIGCHLD;

> struct task_struct *p, *n;
> LIST_HEAD(dead);
>
> @@ -720,23 +729,22 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> if (group_dead)
> kill_orphaned_pgrp(tsk->group_leader, NULL);
>
> - if (unlikely(tsk->ptrace)) {
> - int sig = thread_group_leader(tsk) &&
> - thread_group_empty(tsk) &&
> - !ptrace_reparented(tsk) ?
> - tsk->exit_signal : SIGCHLD;
> + autoreap = true;
> + xxx = thread_group_leader(tsk) && thread_group_empty(tsk);
> +
> + if (xxx || unlikely(tsk->ptrace)) {
> + int sig = xxx && !ptrace_reparented(tsk)
> + ? tsk->exit_signal : SIGCHLD;
> autoreap = do_notify_parent(tsk, sig);
> - } else if (thread_group_leader(tsk)) {
> - autoreap = thread_group_empty(tsk) &&
> - do_notify_parent(tsk, tsk->exit_signal);
> - } else {
> - autoreap = true;
> }
>
> tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
> if (tsk->exit_state == EXIT_DEAD)
> list_add(&tsk->ptrace_entry, &dead);
>
> + if (xxx)
> + do_notify_pidfd(tsk);
> +
> /* mt-exec, de_thread() is waiting for group leader */
> if (unlikely(tsk->signal->notify_count < 0))
> wake_up_process(tsk->signal->group_exit_task);
> --- x/kernel/signal.c
> +++ x/kernel/signal.c
> @@ -1881,14 +1881,6 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
> return ret;
> }
>
> -static void do_notify_pidfd(struct task_struct *task)
> -{
> - struct pid *pid;
> -
> - pid = task_pid(task);
> - wake_up_all(&pid->wait_pidfd);
> -}
> -
> /*
> * Let a parent know about the death of a child.
> * For a stopped/continued status change, use do_notify_parent_cldstop instead.
> @@ -1912,9 +1904,6 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> BUG_ON(!tsk->ptrace &&
> (tsk->group_leader != tsk || !thread_group_empty(tsk)));
>
> - /* Wake up all pidfd waiters */
> - do_notify_pidfd(tsk);
> -
> if (sig != SIGCHLD) {
> /*
> * This is only possible if parent == real_parent.
>