From: Suren Baghdasaryan <[email protected]>
There is a race between reading task->exit_state in pidfd_poll and writing
it after do_notify_parent calls do_notify_pidfd. Expected sequence of
events is:
CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
tsk->exit_state = EXIT_DEAD
pidfd_poll
if (tsk->exit_state)
However nothing prevents the following sequence:
CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
pidfd_poll
if (tsk->exit_state)
tsk->exit_state = EXIT_DEAD
This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug,
and the below patch fixes it.
To fix this, we set tsk->exit_state before calling do_notify_pidfd.
Cc: [email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/exit.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index a75b6a7f458a..740ceacb4b76 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
if (group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);
+ tsk->exit_state = EXIT_ZOMBIE;
if (unlikely(tsk->ptrace)) {
int sig = thread_group_leader(tsk) &&
thread_group_empty(tsk) &&
@@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
ptrace_unlink(p);
/* If parent wants a zombie, don't release it now */
- state = EXIT_ZOMBIE;
+ p->exit_state = EXIT_ZOMBIE;
if (do_notify_parent(p, p->exit_signal))
- state = EXIT_DEAD;
- p->exit_state = state;
+ p->exit_state = EXIT_DEAD;
+
+ state = p->exit_state;
write_unlock_irq(&tasklist_lock);
}
if (state == EXIT_DEAD)
--
2.22.0.657.g960e92d24f-goog
On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> From: Suren Baghdasaryan <[email protected]>
>
> There is a race between reading task->exit_state in pidfd_poll and writing
> it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> events is:
>
> CPU 0 CPU 1
> ------------------------------------------------
> exit_notify
> do_notify_parent
> do_notify_pidfd
> tsk->exit_state = EXIT_DEAD
> pidfd_poll
> if (tsk->exit_state)
>
> However nothing prevents the following sequence:
>
> CPU 0 CPU 1
> ------------------------------------------------
> exit_notify
> do_notify_parent
> do_notify_pidfd
> pidfd_poll
> if (tsk->exit_state)
> tsk->exit_state = EXIT_DEAD
>
> This causes a polling task to wait forever, since poll blocks because
> exit_state is 0 and the waiting task is not notified again. A stress
> test continuously doing pidfd poll and process exits uncovered this bug,
> and the below patch fixes it.
>
> To fix this, we set tsk->exit_state before calling do_notify_pidfd.
>
> Cc: [email protected]
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
That means in such a situation other users will see EXIT_ZOMBIE where
they didn't see that before until after the parent failed to get
notified.
That's a rather subtle internal change. I was worried about
__ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
at the point when we do set p->exit_signal.
Acked-by: Christian Brauner <[email protected]>
Once Oleg confirms that I'm right not to worty I'll pick this up.
Thanks!
Christian
>
> ---
> kernel/exit.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a75b6a7f458a..740ceacb4b76 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> if (group_dead)
> kill_orphaned_pgrp(tsk->group_leader, NULL);
>
> + tsk->exit_state = EXIT_ZOMBIE;
> if (unlikely(tsk->ptrace)) {
> int sig = thread_group_leader(tsk) &&
> thread_group_empty(tsk) &&
> @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> ptrace_unlink(p);
>
> /* If parent wants a zombie, don't release it now */
> - state = EXIT_ZOMBIE;
> + p->exit_state = EXIT_ZOMBIE;
> if (do_notify_parent(p, p->exit_signal))
> - state = EXIT_DEAD;
> - p->exit_state = state;
> + p->exit_state = EXIT_DEAD;
> +
> + state = p->exit_state;
> write_unlock_irq(&tasklist_lock);
> }
> if (state == EXIT_DEAD)
> --
> 2.22.0.657.g960e92d24f-goog
>
On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <[email protected]> wrote:
>
> On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > From: Suren Baghdasaryan <[email protected]>
> >
> > There is a race between reading task->exit_state in pidfd_poll and writing
> > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > events is:
> >
> > CPU 0 CPU 1
> > ------------------------------------------------
> > exit_notify
> > do_notify_parent
> > do_notify_pidfd
> > tsk->exit_state = EXIT_DEAD
> > pidfd_poll
> > if (tsk->exit_state)
> >
> > However nothing prevents the following sequence:
> >
> > CPU 0 CPU 1
> > ------------------------------------------------
> > exit_notify
> > do_notify_parent
> > do_notify_pidfd
> > pidfd_poll
> > if (tsk->exit_state)
> > tsk->exit_state = EXIT_DEAD
> >
> > This causes a polling task to wait forever, since poll blocks because
> > exit_state is 0 and the waiting task is not notified again. A stress
> > test continuously doing pidfd poll and process exits uncovered this bug,
> > and the below patch fixes it.
> >
> > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> >
> > Cc: [email protected]
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
>
> That means in such a situation other users will see EXIT_ZOMBIE where
> they didn't see that before until after the parent failed to get
> notified.
I'm a little nervous about that myself even though in my stress
testing this worked fine. I think the safest change would be to move
do_notify_pidfd() out of do_notify_parent() and call it after
tsk->exit_state is finalized. The downside of that is that there are 4
places we call do_notify_parent(), so instead of calling
do_notify_pidfd() one time from do_notify_parent() we will be calling
it 4 times now.
Also my original patch had memory barriers to ensure correct ordering
of tsk->exit_state writes before reads. In this final version Joel
removed them, so I suppose he found out they are not needed. Joel,
please clarify.
Thanks!
> That's a rather subtle internal change. I was worried about
> __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> at the point when we do set p->exit_signal.
>
> Acked-by: Christian Brauner <[email protected]>
>
> Once Oleg confirms that I'm right not to worty I'll pick this up.
>
> Thanks!
> Christian
>
> >
> > ---
> > kernel/exit.c | 8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index a75b6a7f458a..740ceacb4b76 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > if (group_dead)
> > kill_orphaned_pgrp(tsk->group_leader, NULL);
> >
> > + tsk->exit_state = EXIT_ZOMBIE;
> > if (unlikely(tsk->ptrace)) {
> > int sig = thread_group_leader(tsk) &&
> > thread_group_empty(tsk) &&
> > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > ptrace_unlink(p);
> >
> > /* If parent wants a zombie, don't release it now */
> > - state = EXIT_ZOMBIE;
> > + p->exit_state = EXIT_ZOMBIE;
> > if (do_notify_parent(p, p->exit_signal))
> > - state = EXIT_DEAD;
> > - p->exit_state = state;
> > + p->exit_state = EXIT_DEAD;
> > +
> > + state = p->exit_state;
> > write_unlock_irq(&tasklist_lock);
> > }
> > if (state == EXIT_DEAD)
> > --
> > 2.22.0.657.g960e92d24f-goog
> >
On Wed, Jul 17, 2019 at 11:09:59AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <[email protected]> wrote:
> >
> > On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > > From: Suren Baghdasaryan <[email protected]>
> > >
> > > There is a race between reading task->exit_state in pidfd_poll and writing
> > > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > > events is:
> > >
> > > CPU 0 CPU 1
> > > ------------------------------------------------
> > > exit_notify
> > > do_notify_parent
> > > do_notify_pidfd
> > > tsk->exit_state = EXIT_DEAD
> > > pidfd_poll
> > > if (tsk->exit_state)
> > >
> > > However nothing prevents the following sequence:
> > >
> > > CPU 0 CPU 1
> > > ------------------------------------------------
> > > exit_notify
> > > do_notify_parent
> > > do_notify_pidfd
> > > pidfd_poll
> > > if (tsk->exit_state)
> > > tsk->exit_state = EXIT_DEAD
> > >
> > > This causes a polling task to wait forever, since poll blocks because
> > > exit_state is 0 and the waiting task is not notified again. A stress
> > > test continuously doing pidfd poll and process exits uncovered this bug,
> > > and the below patch fixes it.
> > >
> > > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> > >
> > > Cc: [email protected]
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> >
> > That means in such a situation other users will see EXIT_ZOMBIE where
> > they didn't see that before until after the parent failed to get
> > notified.
>
> I'm a little nervous about that myself even though in my stress
> testing this worked fine. I think the safest change would be to move
> do_notify_pidfd() out of do_notify_parent() and call it after
> tsk->exit_state is finalized. The downside of that is that there are 4
My initial approach to pidfd polling did it this way, and I remember there
was a break in semantics where this does not work well. We want the
notification to happen in do_notify_parent() so that it is in sync with the
wait APIs..
I don't see a risk with this patch though. But let us see what Oleg's eyes
find.
> places we call do_notify_parent(), so instead of calling
> do_notify_pidfd() one time from do_notify_parent() we will be calling
> it 4 times now.
>
> Also my original patch had memory barriers to ensure correct ordering
> of tsk->exit_state writes before reads. In this final version Joel
> removed them, so I suppose he found out they are not needed. Joel,
> please clarify.
The barriers were initially add by me to your patch, but then I felt it may
not be needed so I removed them before sending the patch. My initial concern
was something like the following:
CPU 0 CPU 1
------------------------------------------------
store tsk->exit_state = 1
/* smp_wmb() ? */
do_notify_parent
wake up
poll_wait()
/* smp_rmb(); ? */
read tsk->exit_state = 0
block...
I was initially concerned if tsk->exit_state write would be missed by the
polling thread and we would block forever (similar to this bug).
I don't think this is possible anymore since wakeup implies release-barrier
and waiting implies acquire barrier AFAIU. I am still not fully sure though,
so yeah if you guys think it is an issue, let us add the memory barriers. As
such I know memory barrier additions to the kernel requires justification,
otherwise Linus calls it "Voodoo programming". So let us convince ourself
first if memory barriers are needed before adding them anyway.
thanks,
- Joel
> Thanks!
>
> > That's a rather subtle internal change. I was worried about
> > __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> > seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> > at the point when we do set p->exit_signal.
> >
> > Acked-by: Christian Brauner <[email protected]>
> >
> > Once Oleg confirms that I'm right not to worty I'll pick this up.
> >
> > Thanks!
> > Christian
> >
> > >
> > > ---
> > > kernel/exit.c | 8 +++++---
> > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > index a75b6a7f458a..740ceacb4b76 100644
> > > --- a/kernel/exit.c
> > > +++ b/kernel/exit.c
> > > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > if (group_dead)
> > > kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > + tsk->exit_state = EXIT_ZOMBIE;
> > > if (unlikely(tsk->ptrace)) {
> > > int sig = thread_group_leader(tsk) &&
> > > thread_group_empty(tsk) &&
> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > > ptrace_unlink(p);
> > >
> > > /* If parent wants a zombie, don't release it now */
> > > - state = EXIT_ZOMBIE;
> > > + p->exit_state = EXIT_ZOMBIE;
> > > if (do_notify_parent(p, p->exit_signal))
> > > - state = EXIT_DEAD;
> > > - p->exit_state = state;
> > > + p->exit_state = EXIT_DEAD;
> > > +
> > > + state = p->exit_state;
> > > write_unlock_irq(&tasklist_lock);
> > > }
> > > if (state == EXIT_DEAD)
> > > --
> > > 2.22.0.657.g960e92d24f-goog
> > >
On Wed, Jul 17, 2019 at 07:55:57PM +0200, Christian Brauner wrote:
> On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > From: Suren Baghdasaryan <[email protected]>
> >
> > There is a race between reading task->exit_state in pidfd_poll and writing
> > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > events is:
> >
> > CPU 0 CPU 1
> > ------------------------------------------------
> > exit_notify
> > do_notify_parent
> > do_notify_pidfd
> > tsk->exit_state = EXIT_DEAD
> > pidfd_poll
> > if (tsk->exit_state)
> >
> > However nothing prevents the following sequence:
> >
> > CPU 0 CPU 1
> > ------------------------------------------------
> > exit_notify
> > do_notify_parent
> > do_notify_pidfd
> > pidfd_poll
> > if (tsk->exit_state)
> > tsk->exit_state = EXIT_DEAD
> >
> > This causes a polling task to wait forever, since poll blocks because
> > exit_state is 0 and the waiting task is not notified again. A stress
> > test continuously doing pidfd poll and process exits uncovered this bug,
> > and the below patch fixes it.
> >
> > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> >
> > Cc: [email protected]
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
>
> That means in such a situation other users will see EXIT_ZOMBIE where
> they didn't see that before until after the parent failed to get
> notified.
>
> That's a rather subtle internal change. I was worried about
> __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> at the point when we do set p->exit_signal.
Right.
> Acked-by: Christian Brauner <[email protected]>
Thanks.
> Once Oleg confirms that I'm right not to worty I'll pick this up.
Ok.
On Wed, Jul 17, 2019 at 04:47:58PM -0400, Joel Fernandes wrote:
> On Wed, Jul 17, 2019 at 11:09:59AM -0700, Suren Baghdasaryan wrote:
> > On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <[email protected]> wrote:
> > >
> > > On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > > > From: Suren Baghdasaryan <[email protected]>
> > > >
> > > > There is a race between reading task->exit_state in pidfd_poll and writing
> > > > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > > > events is:
> > > >
> > > > CPU 0 CPU 1
> > > > ------------------------------------------------
> > > > exit_notify
> > > > do_notify_parent
> > > > do_notify_pidfd
> > > > tsk->exit_state = EXIT_DEAD
> > > > pidfd_poll
> > > > if (tsk->exit_state)
> > > >
> > > > However nothing prevents the following sequence:
> > > >
> > > > CPU 0 CPU 1
> > > > ------------------------------------------------
> > > > exit_notify
> > > > do_notify_parent
> > > > do_notify_pidfd
> > > > pidfd_poll
> > > > if (tsk->exit_state)
> > > > tsk->exit_state = EXIT_DEAD
> > > >
> > > > This causes a polling task to wait forever, since poll blocks because
> > > > exit_state is 0 and the waiting task is not notified again. A stress
> > > > test continuously doing pidfd poll and process exits uncovered this bug,
> > > > and the below patch fixes it.
> > > >
> > > > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> > > >
> > > > Cc: [email protected]
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > >
> > > That means in such a situation other users will see EXIT_ZOMBIE where
> > > they didn't see that before until after the parent failed to get
> > > notified.
> >
> > I'm a little nervous about that myself even though in my stress
> > testing this worked fine. I think the safest change would be to move
> > do_notify_pidfd() out of do_notify_parent() and call it after
> > tsk->exit_state is finalized. The downside of that is that there are 4
>
> My initial approach to pidfd polling did it this way, and I remember there
> was a break in semantics where this does not work well. We want the
> notification to happen in do_notify_parent() so that it is in sync with the
> wait APIs..
>
> I don't see a risk with this patch though. But let us see what Oleg's eyes
> find.
I've been going through the various codepaths and that change should be
fine. The places I looked at that worried me were release_task(),
reparent_leader(), wait_consider_task() and their callers.
But all of these either take read_lock(&tasklist_lock) or
write_lock_irq(&tasklist_lock) themselves or are called with them held,
same for ptrace_attach() and ptrace_detach(). And the whole sequence
that switches to autoreaping when the parent ingores SIGCHLD in
do_notify_parent() and wait_task_zombie() is under
write_lock_irq(&tasklist_lock) as well so setting it to EXIT_ZOMBIE
before do_notify_parent() and switching it to EXIT_DEAD when the parent
ignores SIGCHLD should be safe.
If we missed a sublety then we'll know pretty soon I'm sure.
I'll pick this up now. We'll have some time anyway.
>
> > places we call do_notify_parent(), so instead of calling
> > do_notify_pidfd() one time from do_notify_parent() we will be calling
> > it 4 times now.
> >
> > Also my original patch had memory barriers to ensure correct ordering
> > of tsk->exit_state writes before reads. In this final version Joel
> > removed them, so I suppose he found out they are not needed. Joel,
> > please clarify.
>
> The barriers were initially add by me to your patch, but then I felt it may
> not be needed so I removed them before sending the patch. My initial concern
> was something like the following:
>
> CPU 0 CPU 1
> ------------------------------------------------
> store tsk->exit_state = 1
> /* smp_wmb() ? */
> do_notify_parent
> wake up
> poll_wait()
> /* smp_rmb(); ? */
> read tsk->exit_state = 0
> block...
>
>
> I was initially concerned if tsk->exit_state write would be missed by the
> polling thread and we would block forever (similar to this bug).
>
> I don't think this is possible anymore since wakeup implies release-barrier
wake_up_all() which is used in do_notify_pidfd() implies a general
memory barrier if something is actually woken up.
> and waiting implies acquire barrier AFAIU. I am still not fully sure though,
poll_wait() when used with eventpoll hits add_wait_queue which takes
spin_lock_irqsave() which implies an acquire barrier if I remember
memory_barriers right.
> so yeah if you guys think it is an issue, let us add the memory barriers. As
> such I know memory barrier additions to the kernel requires justification,
> otherwise Linus calls it "Voodoo programming". So let us convince ourself
> first if memory barriers are needed before adding them anyway.
I didn't see it as an issue either.
>
> thanks,
>
> - Joel
>
>
>
>
> > Thanks!
> >
> > > That's a rather subtle internal change. I was worried about
> > > __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> > > seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> > > at the point when we do set p->exit_signal.
> > >
> > > Acked-by: Christian Brauner <[email protected]>
> > >
> > > Once Oleg confirms that I'm right not to worty I'll pick this up.
> > >
> > > Thanks!
> > > Christian
> > >
> > > >
> > > > ---
> > > > kernel/exit.c | 8 +++++---
> > > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > > index a75b6a7f458a..740ceacb4b76 100644
> > > > --- a/kernel/exit.c
> > > > +++ b/kernel/exit.c
> > > > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > > if (group_dead)
> > > > kill_orphaned_pgrp(tsk->group_leader, NULL);
> > > >
> > > > + tsk->exit_state = EXIT_ZOMBIE;
> > > > if (unlikely(tsk->ptrace)) {
> > > > int sig = thread_group_leader(tsk) &&
> > > > thread_group_empty(tsk) &&
> > > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > > > ptrace_unlink(p);
> > > >
> > > > /* If parent wants a zombie, don't release it now */
> > > > - state = EXIT_ZOMBIE;
> > > > + p->exit_state = EXIT_ZOMBIE;
> > > > if (do_notify_parent(p, p->exit_signal))
> > > > - state = EXIT_DEAD;
> > > > - p->exit_state = state;
> > > > + p->exit_state = EXIT_DEAD;
> > > > +
> > > > + state = p->exit_state;
> > > > write_unlock_irq(&tasklist_lock);
> > > > }
> > > > if (state == EXIT_DEAD)
> > > > --
> > > > 2.22.0.657.g960e92d24f-goog
> > > >
On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> From: Suren Baghdasaryan <[email protected]>
>
> There is a race between reading task->exit_state in pidfd_poll and writing
> it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> events is:
>
> CPU 0 CPU 1
> ------------------------------------------------
> exit_notify
> do_notify_parent
> do_notify_pidfd
> tsk->exit_state = EXIT_DEAD
> pidfd_poll
> if (tsk->exit_state)
>
> However nothing prevents the following sequence:
>
> CPU 0 CPU 1
> ------------------------------------------------
> exit_notify
> do_notify_parent
> do_notify_pidfd
> pidfd_poll
> if (tsk->exit_state)
> tsk->exit_state = EXIT_DEAD
>
> This causes a polling task to wait forever, since poll blocks because
> exit_state is 0 and the waiting task is not notified again. A stress
> test continuously doing pidfd poll and process exits uncovered this bug,
Btw, if that stress test is in any way upstreamable I'd like to put this
into for-next as well. :)
Christian
On 07/17, Joel Fernandes (Google) wrote:
>
> exit_notify
> do_notify_parent
> do_notify_pidfd
> tsk->exit_state = EXIT_DEAD
OOPS. Somehow I thought it sets exit_state before do_notify_parent(),
I didn't even bother to verify this when I reviewed pidfd_poll()...
Sorry, can't read the patch today, will do tomorrow.
Oleg.
On Thu, Jul 18, 2019 at 3:17 AM Christian Brauner <[email protected]> wrote:
>
> On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > From: Suren Baghdasaryan <[email protected]>
> >
> > There is a race between reading task->exit_state in pidfd_poll and writing
> > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > events is:
> >
> > CPU 0 CPU 1
> > ------------------------------------------------
> > exit_notify
> > do_notify_parent
> > do_notify_pidfd
> > tsk->exit_state = EXIT_DEAD
> > pidfd_poll
> > if (tsk->exit_state)
> >
> > However nothing prevents the following sequence:
> >
> > CPU 0 CPU 1
> > ------------------------------------------------
> > exit_notify
> > do_notify_parent
> > do_notify_pidfd
> > pidfd_poll
> > if (tsk->exit_state)
> > tsk->exit_state = EXIT_DEAD
> >
> > This causes a polling task to wait forever, since poll blocks because
> > exit_state is 0 and the waiting task is not notified again. A stress
> > test continuously doing pidfd poll and process exits uncovered this bug,
>
> Btw, if that stress test is in any way upstreamable I'd like to put this
> into for-next as well. :)
Definitely. I'll work with Joel on making it upstreamable and posting
as a separate patch.
>
> Christian
On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
> it seems that I missed something else...
>
> On 07/17, Joel Fernandes (Google) wrote:
> >
> > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > ptrace_unlink(p);
> >
> > /* If parent wants a zombie, don't release it now */
> > - state = EXIT_ZOMBIE;
> > + p->exit_state = EXIT_ZOMBIE;
> > if (do_notify_parent(p, p->exit_signal))
> > - state = EXIT_DEAD;
> > - p->exit_state = state;
> > + p->exit_state = EXIT_DEAD;
> > +
> > + state = p->exit_state;
> > write_unlock_irq(&tasklist_lock);
>
> why do you think we also need to change wait_task_zombie() ?
>
> pidfd_poll() only needs the exit_state != 0 check, we know that it
> is not zero at this point. Why do we need to change exit_state before
> do_notify_parent() ?
Oh, because of?:
/*
* Move the task's state to DEAD/TRACE, only one thread can do this.
*/
state = (ptrace_reparented(p) && thread_group_leader(p)) ?
EXIT_TRACE : EXIT_DEAD;
if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
return 0;
So exit_state will definitely be set in this scenario. Good point.
Christian
On July 19, 2019 6:51:20 PM GMT+02:00, Joel Fernandes <[email protected]> wrote:
>On Fri, Jul 19, 2019 at 12:27 PM Christian Brauner
><[email protected]> wrote:
>>
>> On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
>> > it seems that I missed something else...
>> >
>> > On 07/17, Joel Fernandes (Google) wrote:
>> > >
>> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct
>wait_opts *wo, struct task_struct *p)
>> > > ptrace_unlink(p);
>> > >
>> > > /* If parent wants a zombie, don't release it now */
>> > > - state = EXIT_ZOMBIE;
>> > > + p->exit_state = EXIT_ZOMBIE;
>> > > if (do_notify_parent(p, p->exit_signal))
>> > > - state = EXIT_DEAD;
>> > > - p->exit_state = state;
>> > > + p->exit_state = EXIT_DEAD;
>> > > +
>> > > + state = p->exit_state;
>> > > write_unlock_irq(&tasklist_lock);
>> >
>> > why do you think we also need to change wait_task_zombie() ?
>> >
>> > pidfd_poll() only needs the exit_state != 0 check, we know that it
>> > is not zero at this point. Why do we need to change exit_state
>before
>> > do_notify_parent() ?
>>
>> Oh, because of?:
>>
>> /*
>> * Move the task's state to DEAD/TRACE, only one thread can
>do this.
>> */
>> state = (ptrace_reparented(p) && thread_group_leader(p)) ?
>> EXIT_TRACE : EXIT_DEAD;
>> if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) !=
>EXIT_ZOMBIE)
>> return 0;
>>
>> So exit_state will definitely be set in this scenario. Good point.
>>
>
>Agreed. Christian, do you mind dropping this hunk from the patch or do
>you want me to resend the patch with the hunk dropped?
Yeah, no problem. :)
On Fri, Jul 19, 2019 at 12:27 PM Christian Brauner <[email protected]> wrote:
>
> On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
> > it seems that I missed something else...
> >
> > On 07/17, Joel Fernandes (Google) wrote:
> > >
> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > > ptrace_unlink(p);
> > >
> > > /* If parent wants a zombie, don't release it now */
> > > - state = EXIT_ZOMBIE;
> > > + p->exit_state = EXIT_ZOMBIE;
> > > if (do_notify_parent(p, p->exit_signal))
> > > - state = EXIT_DEAD;
> > > - p->exit_state = state;
> > > + p->exit_state = EXIT_DEAD;
> > > +
> > > + state = p->exit_state;
> > > write_unlock_irq(&tasklist_lock);
> >
> > why do you think we also need to change wait_task_zombie() ?
> >
> > pidfd_poll() only needs the exit_state != 0 check, we know that it
> > is not zero at this point. Why do we need to change exit_state before
> > do_notify_parent() ?
>
> Oh, because of?:
>
> /*
> * Move the task's state to DEAD/TRACE, only one thread can do this.
> */
> state = (ptrace_reparented(p) && thread_group_leader(p)) ?
> EXIT_TRACE : EXIT_DEAD;
> if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
> return 0;
>
> So exit_state will definitely be set in this scenario. Good point.
>
Agreed. Christian, do you mind dropping this hunk from the patch or do
you want me to resend the patch with the hunk dropped?
it seems that I missed something else...
On 07/17, Joel Fernandes (Google) wrote:
>
> @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> ptrace_unlink(p);
>
> /* If parent wants a zombie, don't release it now */
> - state = EXIT_ZOMBIE;
> + p->exit_state = EXIT_ZOMBIE;
> if (do_notify_parent(p, p->exit_signal))
> - state = EXIT_DEAD;
> - p->exit_state = state;
> + p->exit_state = EXIT_DEAD;
> +
> + state = p->exit_state;
> write_unlock_irq(&tasklist_lock);
why do you think we also need to change wait_task_zombie() ?
pidfd_poll() only needs the exit_state != 0 check, we know that it
is not zero at this point. Why do we need to change exit_state before
do_notify_parent() ?
Oleg.
On Fri, Jul 19, 2019 at 9:27 AM Christian Brauner <[email protected]> wrote:
>
> On Fri, Jul 19, 2019 at 06:14:05PM +0200, Oleg Nesterov wrote:
> > it seems that I missed something else...
> >
> > On 07/17, Joel Fernandes (Google) wrote:
> > >
> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > > ptrace_unlink(p);
> > >
> > > /* If parent wants a zombie, don't release it now */
> > > - state = EXIT_ZOMBIE;
> > > + p->exit_state = EXIT_ZOMBIE;
> > > if (do_notify_parent(p, p->exit_signal))
> > > - state = EXIT_DEAD;
> > > - p->exit_state = state;
> > > + p->exit_state = EXIT_DEAD;
> > > +
> > > + state = p->exit_state;
> > > write_unlock_irq(&tasklist_lock);
> >
> > why do you think we also need to change wait_task_zombie() ?
> >
> > pidfd_poll() only needs the exit_state != 0 check, we know that it
> > is not zero at this point. Why do we need to change exit_state before
> > do_notify_parent() ?
>
> Oh, because of?:
>
> /*
> * Move the task's state to DEAD/TRACE, only one thread can do this.
> */
> state = (ptrace_reparented(p) && thread_group_leader(p)) ?
> EXIT_TRACE : EXIT_DEAD;
> if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE)
> return 0;
>
> So exit_state will definitely be set in this scenario. Good point.
>
Yes, I think you are right. AFAIU in this code path p->exit_state
should always be equal to EXIT_TRACE because of the earlier cmpxchg()
call and the if condition before do_notify_parent(). That's of course
unless there is a chance that p->exit_state gets changed by some other
thread after cmpxchg() call and before do_notify_parent()... I'm not
that familiar with this code to say for sure that it's impossible. If
that can't happen I think we can remove this one but the change in
exit_notify() should definitely stay.
Thanks,
Suren.
> Christian
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>