check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
"softlockup: check all tasks in hung_task" commit ce9dbe24 looks
absolutely wrong.
- rcu_lock_break() does put_task_struct(). If the task has exited
it is not safe to even read its ->state, nothing protects this
task_struct.
- The TASK_DEAD checks are wrong too. Contrary to the comment, we
can't use it to check if the task was unhashed. It can be unhashed
without TASK_DEAD, or it can be valid with TASK_DEAD.
For example, an autoreaping task can do release_task(current)
long before it sets TASK_DEAD in do_exit().
Or, a zombie task can have ->state == TASK_DEAD but release_task()
was not called, and in this case we must not break the loop.
Change this code to check pid_alive() instead, and do this before we
drop the reference to the task_struct.
Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/hung_task.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
--- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100
+++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200
@@ -113,15 +113,20 @@ static void check_hung_task(struct task_
* For preemptible RCU it is sufficient to call rcu_read_unlock in order
* exit the grace period. For classic RCU, a reschedule is required.
*/
-static void rcu_lock_break(struct task_struct *g, struct task_struct *t)
+static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
{
+ bool can_cont;
+
get_task_struct(g);
get_task_struct(t);
rcu_read_unlock();
cond_resched();
rcu_read_lock();
+ can_cont = pid_alive(g) && pid_alive(t);
put_task_struct(t);
put_task_struct(g);
+
+ return can_cont;
}
/*
@@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t
goto unlock;
if (!--batch_count) {
batch_count = HUNG_TASK_BATCHING;
- rcu_lock_break(g, t);
- /* Exit if t or g was unhashed during refresh. */
- if (t->state == TASK_DEAD || g->state == TASK_DEAD)
+ if (!rcu_lock_break(g, t))
goto unlock;
}
/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
(add cc's)
Hmm. Once I sent this patch, I suddenly realized with horror that
while_each_thread() is NOT safe under rcu_read_lock(). Both
do_each_thread/while_each_thread or do/while_each_thread() can
race with exec().
Yes, it is safe to do next_thread() or next_task(). But:
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)
suppose that t is not the group leader, and it does de_thread() and then
release_task(g). After that next_thread(t) returns t, not g, and the loop
will never stop.
I _really_ hope I missed something, will recheck tomorrow with the fresh
head. Still I'd like to share my concerns...
If I am right, probably we can fix this, something like
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g && pid_alive(g))
[we can't do while (!thread_group_leadr(t = next_thread(t)))].
but this needs barrires, and we should validate the callers anyway.
Or, perhaps,
#define XXX(t) ({
struct task_struct *__prev = t;
t = next_thread(t);
t != g && t != __prev;
})
#define while_each_thread(g, t) \
while (XXX(t))
Please tell me I am wrong!
Oleg.
On 06/18, Oleg Nesterov wrote:
>
> check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
> "softlockup: check all tasks in hung_task" commit ce9dbe24 looks
> absolutely wrong.
>
> - rcu_lock_break() does put_task_struct(). If the task has exited
> it is not safe to even read its ->state, nothing protects this
> task_struct.
>
> - The TASK_DEAD checks are wrong too. Contrary to the comment, we
> can't use it to check if the task was unhashed. It can be unhashed
> without TASK_DEAD, or it can be valid with TASK_DEAD.
>
> For example, an autoreaping task can do release_task(current)
> long before it sets TASK_DEAD in do_exit().
>
> Or, a zombie task can have ->state == TASK_DEAD but release_task()
> was not called, and in this case we must not break the loop.
>
> Change this code to check pid_alive() instead, and do this before we
> drop the reference to the task_struct.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
>
> kernel/hung_task.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100
> +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200
> @@ -113,15 +113,20 @@ static void check_hung_task(struct task_
> * For preemptible RCU it is sufficient to call rcu_read_unlock in order
> * exit the grace period. For classic RCU, a reschedule is required.
> */
> -static void rcu_lock_break(struct task_struct *g, struct task_struct *t)
> +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
> {
> + bool can_cont;
> +
> get_task_struct(g);
> get_task_struct(t);
> rcu_read_unlock();
> cond_resched();
> rcu_read_lock();
> + can_cont = pid_alive(g) && pid_alive(t);
> put_task_struct(t);
> put_task_struct(g);
> +
> + return can_cont;
> }
>
> /*
> @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t
> goto unlock;
> if (!--batch_count) {
> batch_count = HUNG_TASK_BATCHING;
> - rcu_lock_break(g, t);
> - /* Exit if t or g was unhashed during refresh. */
> - if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> + if (!rcu_lock_break(g, t))
> goto unlock;
> }
> /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
On Fri, Jun 18, 2010 at 09:02:51PM +0200, Oleg Nesterov wrote:
> check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
> "softlockup: check all tasks in hung_task" commit ce9dbe24 looks
> absolutely wrong.
>
> - rcu_lock_break() does put_task_struct(). If the task has exited
> it is not safe to even read its ->state, nothing protects this
> task_struct.
>
> - The TASK_DEAD checks are wrong too. Contrary to the comment, we
> can't use it to check if the task was unhashed. It can be unhashed
> without TASK_DEAD, or it can be valid with TASK_DEAD.
>
> For example, an autoreaping task can do release_task(current)
> long before it sets TASK_DEAD in do_exit().
>
> Or, a zombie task can have ->state == TASK_DEAD but release_task()
> was not called, and in this case we must not break the loop.
>
> Change this code to check pid_alive() instead, and do this before we
> drop the reference to the task_struct.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
Very nice!
Acked-by: Frederic Weisbecker <[email protected]>
Oleg Nesterov ([email protected]) wrote:
> check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
> "softlockup: check all tasks in hung_task" commit ce9dbe24 looks
> absolutely wrong.
>
> - rcu_lock_break() does put_task_struct(). If the task has exited
> it is not safe to even read its ->state, nothing protects this
> task_struct.
>
> - The TASK_DEAD checks are wrong too. Contrary to the comment, we
> can't use it to check if the task was unhashed. It can be unhashed
> without TASK_DEAD, or it can be valid with TASK_DEAD.
>
> For example, an autoreaping task can do release_task(current)
> long before it sets TASK_DEAD in do_exit().
>
> Or, a zombie task can have ->state == TASK_DEAD but release_task()
> was not called, and in this case we must not break the loop.
>
> Change this code to check pid_alive() instead, and do this before we
> drop the reference to the task_struct.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
Acked-by: Mandeep Singh Baines <[email protected]>
I think you're right. I can't see what would prevent that race.
So for_each_process and do_each_thread are safe only under
read_lock(&tasklist_lock) and while_each_thread is only safe under
either that or siglock. (Also a few places using next_thread in
similar loops outside those macros.)
Perhaps we could move those del's from __unhash_process to
__put_task_struct (or just delayed_put_task_struct?) and then
they wouldn't need to be rculist.h calls after all. But we
would a need careful audit to figure out the assumptions about
being on the list meaning not reaped yet.
I think de_thread() in exec-by-nonleader is the only case where this
can happen, right? So then perhaps we could make it call release_task
only via call_rcu?
Thanks,
Roland
On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
> (add cc's)
>
> Hmm. Once I sent this patch, I suddenly realized with horror that
> while_each_thread() is NOT safe under rcu_read_lock(). Both
> do_each_thread/while_each_thread or do/while_each_thread() can
> race with exec().
>
> Yes, it is safe to do next_thread() or next_task(). But:
>
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>
> suppose that t is not the group leader, and it does de_thread() and then
> release_task(g). After that next_thread(t) returns t, not g, and the loop
> will never stop.
>
> I _really_ hope I missed something, will recheck tomorrow with the fresh
> head. Still I'd like to share my concerns...
>
> If I am right, probably we can fix this, something like
>
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g && pid_alive(g))
>
> [we can't do while (!thread_group_leadr(t = next_thread(t)))].
> but this needs barrires, and we should validate the callers anyway.
>
> Or, perhaps,
>
> #define XXX(t) ({
> struct task_struct *__prev = t;
> t = next_thread(t);
> t != g && t != __prev;
> })
>
> #define while_each_thread(g, t) \
> while (XXX(t))
Isn't the above vulnerable to a pthread_create() immediately following
the offending exec()? Especially if the task doing the traversal is
preempted?
I cannot claim to understand the task-list code, but here are some
techniques that might (or might not) help:
o Check ACCESS_ONCE(p->group_leader == g), if false, restart
the traversal. Any race on update of p->group_leader would
sort itself out on later iterations. This of course might
require careful attention of the order of updates to ->group_leader
and the list pointers. I also don't like it much from a
worst-case response-time viewpoint. ;-)
Plus it requires all operations on the tasks be idempotent,
which is a bit ugly and restrictive.
o Maintain an ->oldnext field that tracks the old pointer to
the next task for one RCU grace period after a de_thread()
operation. When the grace period expires (presumably via
call_rcu()), the ->oldnext field is set to NULL.
If the ->oldnext field is non-NULL, any subsequent de_thread()
operations wait until it is NULL. (I convinced myself that
pthread_create() need -not- wait, but perhaps mistakenly --
the idea is that any recent de_thread() victim remains group
leader, so is skipped by while_each_thread(), but you would
know better than I.)
Then while_each_thread() does checks similar to what you have
above, possibly with the addition of the ->group_leader check,
but follows the ->oldnext pointer if the checks indicate that
this task has de_thread()ed itself. The ->oldnext access will
need to be preceded by a memory barrier, but this is off the
fast path, so should be OK. There would also need to be
memory barriers on the update side.
o Do the de_thread() incrementally. So if the list is tasks A,
B, and C, in that order, and if we are de-thread()ing B,
then make A's pointer refer to C, wait a grace period, then
complete the de_thread() operation. I would be surprised if
people would actually be happy with the resulting increase in
exec() overhead, but mentioning it for completeness. Of course,
synchronize_rcu_expedited() has much shorter latency, and might
work in this situation. (Though please do let me know if you
choose this approach -- it will mean that I need to worry about
synchronize_rcu_expedited() scalability sooner rather than
later! Which is OK as long as I know.)
This all assumes that is OK for de_thread() to block, but I have
no idea if this is the case.
> Please tell me I am wrong!
It would take a braver man than me to say that Oleg Nesterov is wrong!
Thanx, Paul
> Oleg.
>
> On 06/18, Oleg Nesterov wrote:
> >
> > check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
> > "softlockup: check all tasks in hung_task" commit ce9dbe24 looks
> > absolutely wrong.
> >
> > - rcu_lock_break() does put_task_struct(). If the task has exited
> > it is not safe to even read its ->state, nothing protects this
> > task_struct.
> >
> > - The TASK_DEAD checks are wrong too. Contrary to the comment, we
> > can't use it to check if the task was unhashed. It can be unhashed
> > without TASK_DEAD, or it can be valid with TASK_DEAD.
> >
> > For example, an autoreaping task can do release_task(current)
> > long before it sets TASK_DEAD in do_exit().
> >
> > Or, a zombie task can have ->state == TASK_DEAD but release_task()
> > was not called, and in this case we must not break the loop.
> >
> > Change this code to check pid_alive() instead, and do this before we
> > drop the reference to the task_struct.
> >
> > Signed-off-by: Oleg Nesterov <[email protected]>
> > ---
> >
> > kernel/hung_task.c | 11 +++++++----
> > 1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100
> > +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200
> > @@ -113,15 +113,20 @@ static void check_hung_task(struct task_
> > * For preemptible RCU it is sufficient to call rcu_read_unlock in order
> > * exit the grace period. For classic RCU, a reschedule is required.
> > */
> > -static void rcu_lock_break(struct task_struct *g, struct task_struct *t)
> > +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
> > {
> > + bool can_cont;
> > +
> > get_task_struct(g);
> > get_task_struct(t);
> > rcu_read_unlock();
> > cond_resched();
> > rcu_read_lock();
> > + can_cont = pid_alive(g) && pid_alive(t);
> > put_task_struct(t);
> > put_task_struct(g);
> > +
> > + return can_cont;
> > }
> >
> > /*
> > @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t
> > goto unlock;
> > if (!--batch_count) {
> > batch_count = HUNG_TASK_BATCHING;
> > - rcu_lock_break(g, t);
> > - /* Exit if t or g was unhashed during refresh. */
> > - if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> > + if (!rcu_lock_break(g, t))
> > goto unlock;
> > }
> > /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
>
On 06/18, Roland McGrath wrote:
>
> I think you're right. I can't see what would prevent that race.
How sad.
> So for_each_process
for_each_process() looks fine. It uses init_task as the anchor,
it can't go away, it is swapper.
> and do_each_thread are safe only under
> read_lock(&tasklist_lock) and while_each_thread is only safe under
> either that or siglock.
Yes,
(Also a few places using next_thread in
> similar loops outside those macros.)
I hope that most (all?) of next_thread() users can be converted to
use while_each_thread().
> Perhaps we could move those del's from __unhash_process to
> __put_task_struct (or just delayed_put_task_struct?)
This needs write_lock_irq(tasklist), we can't take it in atomic
context. And I bet this change (at least right now) has other
implications.
> I think de_thread() in exec-by-nonleader is the only case where this
> can happen, right? So then perhaps we could make it call release_task
> only via call_rcu?
Hmm, perhaps... I am already sleeping, will try to check this idea
tomorrow. At first glance, it looks promising to me. And I see the
email from Paul which is too late to read for me today ;)
In any case, I _think_ we can fix while_each_thread(), say XXX(t)
from the previous email. But then we should audit the users like
zap_threads() which assume we should not miss any "interesting" task.
Probably zap_threads() is fine because of mmap_sem, but I can't
think properly now.
Oleg.
On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <[email protected]> wrote:
> (add cc's)
>
> Hmm. Once I sent this patch, I suddenly realized with horror that
> while_each_thread() is NOT safe under rcu_read_lock(). Both
> do_each_thread/while_each_thread or do/while_each_thread() can
> race with exec().
>
> Yes, it is safe to do next_thread() or next_task(). But:
>
> ? ? ? ?#define while_each_thread(g, t) \
> ? ? ? ? ? ? ? ?while ((t = next_thread(t)) != g)
>
> suppose that t is not the group leader, and it does de_thread() and then
> release_task(g). After that next_thread(t) returns t, not g, and the loop
> will never stop.
>
> I _really_ hope I missed something, will recheck tomorrow with the fresh
> head. Still I'd like to share my concerns...
>
Yep. You're right. Not sure what I was thinking. This is only case
where do_each_thread
is protected by an rcu_read_lock. All others, correctly use read_lock.
> If I am right, probably we can fix this, something like
>
> ? ? ? ?#define while_each_thread(g, t) \
> ? ? ? ? ? ? ? ?while ((t = next_thread(t)) != g && pid_alive(g))
>
This seems like a reasonable approach. Maybe call it:
while_each_thread_maybe_rcu() :)
This makes hung_task a little less useful for failure fencing since
this (and rcu_lock_break)
increases the potential for never examining all threads but its still
a nice lightweight diagnostic
for finding bugs.
> [we can't do while (!thread_group_leadr(t = next_thread(t)))].
> but this needs barrires, and we should validate the callers anyway.
>
> Or, perhaps,
>
> ? ? ? ?#define XXX(t) ?({
> ? ? ? ? ? ? ? ?struct task_struct *__prev = t;
> ? ? ? ? ? ? ? ?t = next_thread(t);
> ? ? ? ? ? ? ? ?t != g && t != __prev;
> ? ? ? ?})
>
> ? ? ? ?#define while_each_thread(g, t) \
> ? ? ? ? ? ? ? ?while (XXX(t))
>
> Please tell me I am wrong!
>
> Oleg.
>
> On 06/18, Oleg Nesterov wrote:
>>
>> check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
>> "softlockup: check all tasks in hung_task" commit ce9dbe24 looks
>> absolutely wrong.
>>
>> ? ? ? - rcu_lock_break() does put_task_struct(). If the task has exited
>> ? ? ? ? it is not safe to even read its ->state, nothing protects this
>> ? ? ? ? task_struct.
>>
>> ? ? ? - The TASK_DEAD checks are wrong too. Contrary to the comment, we
>> ? ? ? ? can't use it to check if the task was unhashed. It can be unhashed
>> ? ? ? ? without TASK_DEAD, or it can be valid with TASK_DEAD.
>>
>> ? ? ? ? For example, an autoreaping task can do release_task(current)
>> ? ? ? ? long before it sets TASK_DEAD in do_exit().
>>
>> ? ? ? ? Or, a zombie task can have ->state == TASK_DEAD but release_task()
>> ? ? ? ? was not called, and in this case we must not break the loop.
>>
>> Change this code to check pid_alive() instead, and do this before we
>> drop the reference to the task_struct.
>>
>> Signed-off-by: Oleg Nesterov <[email protected]>
>> ---
>>
>> ?kernel/hung_task.c | ? 11 +++++++----
>> ?1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK ?2009-12-18 19:05:38.000000000 +0100
>> +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200
>> @@ -113,15 +113,20 @@ static void check_hung_task(struct task_
>> ? * For preemptible RCU it is sufficient to call rcu_read_unlock in order
>> ? * exit the grace period. For classic RCU, a reschedule is required.
>> ? */
>> -static void rcu_lock_break(struct task_struct *g, struct task_struct *t)
>> +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
>> ?{
>> + ? ? bool can_cont;
>> +
>> ? ? ? get_task_struct(g);
>> ? ? ? get_task_struct(t);
>> ? ? ? rcu_read_unlock();
>> ? ? ? cond_resched();
>> ? ? ? rcu_read_lock();
>> + ? ? can_cont = pid_alive(g) && pid_alive(t);
>> ? ? ? put_task_struct(t);
>> ? ? ? put_task_struct(g);
>> +
>> + ? ? return can_cont;
>> ?}
>>
>> ?/*
>> @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t
>> ? ? ? ? ? ? ? ? ? ? ? goto unlock;
>> ? ? ? ? ? ? ? if (!--batch_count) {
>> ? ? ? ? ? ? ? ? ? ? ? batch_count = HUNG_TASK_BATCHING;
>> - ? ? ? ? ? ? ? ? ? ? rcu_lock_break(g, t);
>> - ? ? ? ? ? ? ? ? ? ? /* Exit if t or g was unhashed during refresh. */
>> - ? ? ? ? ? ? ? ? ? ? if (t->state == TASK_DEAD || g->state == TASK_DEAD)
>> + ? ? ? ? ? ? ? ? ? ? if (!rcu_lock_break(g, t))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto unlock;
>> ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
>
>
On Fri, Jun 18, 2010 at 10:00:54PM -0700, Mandeep Baines wrote:
> On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <[email protected]> wrote:
> > (add cc's)
> >
> > Hmm. Once I sent this patch, I suddenly realized with horror that
> > while_each_thread() is NOT safe under rcu_read_lock(). Both
> > do_each_thread/while_each_thread or do/while_each_thread() can
> > race with exec().
> >
> > Yes, it is safe to do next_thread() or next_task(). But:
> >
> > ? ? ? ?#define while_each_thread(g, t) \
> > ? ? ? ? ? ? ? ?while ((t = next_thread(t)) != g)
> >
> > suppose that t is not the group leader, and it does de_thread() and then
> > release_task(g). After that next_thread(t) returns t, not g, and the loop
> > will never stop.
> >
> > I _really_ hope I missed something, will recheck tomorrow with the fresh
> > head. Still I'd like to share my concerns...
> >
>
> Yep. You're right. Not sure what I was thinking. This is only case
> where do_each_thread
> is protected by an rcu_read_lock. All others, correctly use read_lock.
cgroup does too.
taskstats also uses rcu with while_each_threads, and may be some
others.
It's not your fault, theses iterators are supposed to be rcu safe,
we are just encountering a bad race :)
> > If I am right, probably we can fix this, something like
> >
> > ? ? ? ?#define while_each_thread(g, t) \
> > ? ? ? ? ? ? ? ?while ((t = next_thread(t)) != g && pid_alive(g))
> >
>
> This seems like a reasonable approach. Maybe call it:
>
> while_each_thread_maybe_rcu() :)
Hmm, no while_each_thread must really be rcu_safe.
>
> This makes hung_task a little less useful for failure fencing since
> this (and rcu_lock_break)
> increases the potential for never examining all threads but its still
> a nice lightweight diagnostic
> for finding bugs.
In fact may be we could just drop the rcu break, people who really
care about latencies can use the preemptable version.
I know the worry is more about delaying too much the grace period if
we walk a too long task list, but I don't think it's really a problem.
On Fri, Jun 18, 2010 at 10:35 PM, Frederic Weisbecker
<[email protected]> wrote:
> On Fri, Jun 18, 2010 at 10:00:54PM -0700, Mandeep Baines wrote:
>> On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <[email protected]> wrote:
>> > (add cc's)
>> >
>> > Hmm. Once I sent this patch, I suddenly realized with horror that
>> > while_each_thread() is NOT safe under rcu_read_lock(). Both
>> > do_each_thread/while_each_thread or do/while_each_thread() can
>> > race with exec().
>> >
>> > Yes, it is safe to do next_thread() or next_task(). But:
>> >
>> > ? ? ? ?#define while_each_thread(g, t) \
>> > ? ? ? ? ? ? ? ?while ((t = next_thread(t)) != g)
>> >
>> > suppose that t is not the group leader, and it does de_thread() and then
>> > release_task(g). After that next_thread(t) returns t, not g, and the loop
>> > will never stop.
>> >
>> > I _really_ hope I missed something, will recheck tomorrow with the fresh
>> > head. Still I'd like to share my concerns...
>> >
>>
>> Yep. You're right. Not sure what I was thinking. This is only case
>> where do_each_thread
>> is protected by an rcu_read_lock. All others, correctly use read_lock.
>
>
>
> cgroup does too.
> taskstats also uses rcu with while_each_threads, and may be some
> others.
>
> It's not your fault, theses iterators are supposed to be rcu safe,
> we are just encountering a bad race :)
>
Thanks:) Feel less dumb now. My quick grep only turned up hung_task:
$ find . -name \*.c | xargs fgrep -B 10 do_each_thread | grep rcu
./kernel/hung_task.c- rcu_read_lock();
>
>
>> > If I am right, probably we can fix this, something like
>> >
>> > ? ? ? ?#define while_each_thread(g, t) \
>> > ? ? ? ? ? ? ? ?while ((t = next_thread(t)) != g && pid_alive(g))
>> >
>>
>> This seems like a reasonable approach. Maybe call it:
>>
>> while_each_thread_maybe_rcu() :)
>
>
>
> Hmm, no while_each_thread must really be rcu_safe.
>
I didn't realize there were other cases which need while_each_thread to
be rcu-safe. For hung_task, its OK to break out on a release_task(g).
We'll just check the threads we missed on the next iteration.
>
>
>>
>> This makes hung_task a little less useful for failure fencing since
>> this (and rcu_lock_break)
>> increases the potential for never examining all threads but its still
>> a nice lightweight diagnostic
>> for finding bugs.
>
>
>
> In fact may be we could just drop the rcu break, people who really
> care about latencies can use the preemptable version.
>
For large systems, you'd pin a CPU for a very long time checking for
hung_tasks. You'd cause a lot of memory pressure by delaying the
grace period for such a long time. You'd also cause softlockups with
the huge burst of call_rcus being processed by rcu_process_callbacks.
> I know the worry is more about delaying too much the grace period if
> we walk a too long task list, but I don't think it's really a problem.
>
>
On 06/18, Mandeep Baines wrote:
>
> On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <[email protected]> wrote:
> >
> >
> > I _really_ hope I missed something, will recheck tomorrow with the fresh
> > head. Still I'd like to share my concerns...
>
> Yep. You're right. Not sure what I was thinking. This is only case
> where do_each_thread
> is protected by an rcu_read_lock. All others, correctly use read_lock.
No, no. while_each_thread() is supposed to be rcu-safe, we should fix it.
It has many rcu_read_lock() users.
The patch which fixes rcu_lock_break() is orthogonal to this problem.
Oleg.
On 06/18, Paul E. McKenney wrote:
>
> On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
> >
> > #define XXX(t) ({
> > struct task_struct *__prev = t;
> > t = next_thread(t);
> > t != g && t != __prev;
> > })
> >
> > #define while_each_thread(g, t) \
> > while (XXX(t))
>
> Isn't the above vulnerable to a pthread_create() immediately following
> the offending exec()? Especially if the task doing the traversal is
> preempted?
Yes, thanks!
> here are some techniques that might (or might not) help:
To simplify, let's consider the concrete example,
rcu_read_lock();
g = t = returns_the_rcu_safe_task_struct_ptr();
do {
printk("%d\n", t->pid);
} while_each_thread(g, t);
rcu_read_unlock();
Whatever we do, without tasklist/siglock this can obviously race
with fork/exit/exec. It is OK to miss a thread, or print the pid
of the already exited/released task.
But it should not loop forever (the subject), and it should not
print the same pid twice (ignoring pid reuse, of course).
And, afaics, there are no problems with rcu magic per se, next_thread()
always returns the task_struct we can safely dereference. The only
problem is that while_each_thread() assumes that sooner or later
next_thread() must reach the starting point, g.
(zap_threads() is different, it must not miss a thread with ->mm
we are going to dump, but it holds mmap_sem).
> o Check ACCESS_ONCE(p->group_leader == g),
but this assumeas that while_each_thread(g, t) always
uses g == group_leader. zap_other_threads(), for example,
starts with g == current and not necessarily the leader.
> if false, restart
> the traversal.
I am not sure we should restart (don't pid the same pid twice),
probably we should break the loop.
So, I am thinking about the first attempt
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g && pid_alive(g))
again. But this means while_each_thread() can miss more threads
than it currently can under the same conditions. Correct, but
not good.
And,
> This of course might
> require careful attention of the order of updates to ->group_leader
> and the list pointers.
Yes. At least the code above needs barriers, and I am not sure
it can be really correct without more complications.
> o Maintain an ->oldnext field that tracks the old pointer to
> the next task for one RCU grace period after a de_thread()
> operation. When the grace period expires (presumably via
> call_rcu()), the ->oldnext field is set to NULL.
Well, another field in task_struct...
> o Do the de_thread() incrementally. So if the list is tasks A,
> B, and C, in that order, and if we are de-thread()ing B,
> then make A's pointer refer to C,
This breaks while_each_thread() under tasklist/siglock. It must
see all unhashed tasks.
Oh. I'll try to think more, but I also hope someone else can suggest
a simple solution.
> > Please tell me I am wrong!
>
> It would take a braver man than me to say that Oleg Nesterov is wrong!
Hmm. Given that you proved I was wrong in the first lines of your
email, I do not know what to think ;)
Oleg.
On 06/21, Oleg Nesterov wrote:
>
> So, I am thinking about the first attempt
>
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g && pid_alive(g))
>
> again. But this means while_each_thread() can miss more threads
> than it currently can under the same conditions. Correct, but
> not good.
Not good, but correct ;) Probably it makes sense to fix the problem
anyway, then think about the more optimal fix.
static inline struct task_struct *
next_thread_careful(const struct task_struct *g, const struct task_struct *t)
{
t = next_thread(t);
/*
* this pairs with the implicit barrier between detach_pid()
* and list_del_rcu(g->thread_group) in __unhash_process(g).
*/
smp_rmb();
if (likely(pid_alive(g)))
return t;
else
return g;
}
#define while_each_thread(g, t) \
while ((t = next_thread_careful(t)) != g)
I think this should work. detach_pid() does unlock + lock at least
once and thus we have the barrier (this worth a comment or we
can add the explicit wmb() in __unhash_process).
Paul, Roland, do you see any problems from the correctness pov,
or a better fix for now?
Perhaps it also makes sense to keep the old variant renamed to
while_each_thread_locked(), I dunno.
Oleg.
On 06/21, Oleg Nesterov wrote:
>
> On 06/21, Oleg Nesterov wrote:
> >
> > So, I am thinking about the first attempt
> >
> > #define while_each_thread(g, t) \
> > while ((t = next_thread(t)) != g && pid_alive(g))
> >
> > again. But this means while_each_thread() can miss more threads
> > than it currently can under the same conditions. Correct, but
> > not good.
>
> Not good, but correct ;) Probably it makes sense to fix the problem
> anyway, then think about the more optimal fix.
>
> static inline struct task_struct *
> next_thread_careful(const struct task_struct *g, const struct task_struct *t)
> {
> t = next_thread(t);
> /*
> * this pairs with the implicit barrier between detach_pid()
> * and list_del_rcu(g->thread_group) in __unhash_process(g).
> */
> smp_rmb();
> if (likely(pid_alive(g)))
> return t;
> else
> return g;
> }
>
> #define while_each_thread(g, t) \
> while ((t = next_thread_careful(t)) != g)
>
> I think this should work. detach_pid() does unlock + lock at least
> once and thus we have the barrier (this worth a comment or we
> can add the explicit wmb() in __unhash_process).
>
> Paul, Roland, do you see any problems from the correctness pov,
> or a better fix for now?
>
> Perhaps it also makes sense to keep the old variant renamed to
> while_each_thread_locked(), I dunno.
Well. but current_is_single_threaded() and zap_threads() have to
use next_thread() or while_each_thread_locked() in this case...
Oleg.
> Paul, Roland, do you see any problems from the correctness pov,
> or a better fix for now?
>
> Perhaps it also makes sense to keep the old variant renamed to
> while_each_thread_locked(), I dunno.
Did we verify that only de_thread() can create the situation where a
while_each_thread-style loop without either lock can be confused? If
that's so, then just changing it to avoid the situation seems like it
would be less invasive overall.
Thanks,
Roland
On 06/21, Roland McGrath wrote:
>
> > Paul, Roland, do you see any problems from the correctness pov,
> > or a better fix for now?
> >
> > Perhaps it also makes sense to keep the old variant renamed to
> > while_each_thread_locked(), I dunno.
>
> Did we verify that only de_thread() can create the situation where a
> while_each_thread-style loop without either lock can be confused?
I think yes, this is is the only case.
I mean, while_each_thread(group_leader, t). If g != group_leader, then
the lockless while_each_thread() has problems with the plain exit(g).
Afaics. The more I think about this, the more I feel confused ;)
But if we start from ->group_leader, then while_each_thread() must
stop eventually. Otherwise we should assume that the dead (unhashed)
tasks can create the circular list, obviously this is not possible.
> If
> that's so, then just changing it to avoid the situation seems like it
> would be less invasive overall.
How? We should change ->group_leader uner write_lock_irq(tasklist),
synchronize_rcu() is not an option. We can't do call_rcu(release_task),
we can't take tasklist for writing in the softirq context. But even
if we could, this can't help in fact or I missed something.
Oleg.
On Mon, Jun 21, 2010 at 07:09:19PM +0200, Oleg Nesterov wrote:
> On 06/18, Paul E. McKenney wrote:
> >
> > On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
> > >
> > > #define XXX(t) ({
> > > struct task_struct *__prev = t;
> > > t = next_thread(t);
> > > t != g && t != __prev;
> > > })
> > >
> > > #define while_each_thread(g, t) \
> > > while (XXX(t))
> >
> > Isn't the above vulnerable to a pthread_create() immediately following
> > the offending exec()? Especially if the task doing the traversal is
> > preempted?
>
> Yes, thanks!
>
> > here are some techniques that might (or might not) help:
>
> To simplify, let's consider the concrete example,
Sounds very good!
> rcu_read_lock();
>
> g = t = returns_the_rcu_safe_task_struct_ptr();
This returns a pointer to the task struct of the current thread?
Or might this return a pointer some other thread's task struct?
> do {
> printk("%d\n", t->pid);
> } while_each_thread(g, t);
>
> rcu_read_unlock();
>
> Whatever we do, without tasklist/siglock this can obviously race
> with fork/exit/exec. It is OK to miss a thread, or print the pid
> of the already exited/released task.
>
> But it should not loop forever (the subject), and it should not
> print the same pid twice (ignoring pid reuse, of course).
>
> And, afaics, there are no problems with rcu magic per se, next_thread()
> always returns the task_struct we can safely dereference. The only
> problem is that while_each_thread() assumes that sooner or later
> next_thread() must reach the starting point, g.
>
> (zap_threads() is different, it must not miss a thread with ->mm
> we are going to dump, but it holds mmap_sem).
Indeed, the tough part is figuring out when you are done given that things
can come and go at will. Some additional tricks, in no particular order:
1. Always start at the group leader. Of course, the group leader
is probably permitted to leave any time it wants to, so this
is not sufficient in and of itself.
2. Maintain a separate task structure that flags the head of the
list. This separate structure is freed one RCU grace period
following the disappearance of the current group leader. This
should be quite robust, but "holy overhead, Batman!!!" (Apologies
for the American pop culture reference, but nothing else seemed
appropriate.)
3. Like #2, but reduce the storage overhead by using a structure
containing only the pointers and perhaps a few additional fields.
Set one of the bottom bits of one of the pointers to flag the fact
that this is a special structure, allowing the reader to take
note and do any additional checks required.
Of course, it is possible for a reader to be diverted onto a
new thread group, so it is then also possible that it needs
to return back to the previous thread group. The smaller structure
corresponding to the smaller structure might contain the
pointers required for it to make this transition.
Please note that a given thread group will have several of
these structures if several thread-group depart in quick
succession. The smaller structure would need to contain
enough information to allow the reader to thread back to
whatever group-leader task it started from. Which will likely
make managing the lifetimes of these smaller structures
"interesting"... Though this could be eased by maintaining
forward links as well as backwards links. Then when the
grace period expired, the corresponding backwards links could
be NULLed -- and a second grace period would likely be needed.
It is tempting to leave these special structs out of any
thread groups that have not yet had a leader depart, but this
is probably vulnerable to looping. (It might be possible to
avoid this, but additional checks are required.)
There are other approaches, but it is probably time for you to teach me
more about the requirements you face.
> > o Check ACCESS_ONCE(p->group_leader == g),
>
> but this assumeas that while_each_thread(g, t) always
> uses g == group_leader. zap_other_threads(), for example,
> starts with g == current and not necessarily the leader.
>
> > if false, restart
> > the traversal.
>
> I am not sure we should restart (don't pid the same pid twice),
> probably we should break the loop.
>
> So, I am thinking about the first attempt
>
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g && pid_alive(g))
>
> again. But this means while_each_thread() can miss more threads
> than it currently can under the same conditions. Correct, but
> not good.
OK, so we really want to complete the scan through the thread group, or
at least avoid introducing additional tendency to not complete the scan.
Good to know. ;-)
> And,
>
> > This of course might
> > require careful attention of the order of updates to ->group_leader
> > and the list pointers.
>
> Yes. At least the code above needs barriers, and I am not sure
> it can be really correct without more complications.
Agreed!
> > o Maintain an ->oldnext field that tracks the old pointer to
> > the next task for one RCU grace period after a de_thread()
> > operation. When the grace period expires (presumably via
> > call_rcu()), the ->oldnext field is set to NULL.
>
> Well, another field in task_struct...
Yeah, would be good to avoid this. Not sure it can be avoided, though.
> > o Do the de_thread() incrementally. So if the list is tasks A,
> > B, and C, in that order, and if we are de-thread()ing B,
> > then make A's pointer refer to C,
>
> This breaks while_each_thread() under tasklist/siglock. It must
> see all unhashed tasks.
Could de_thread() hold those locks in order to avoid that breakage?
> Oh. I'll try to think more, but I also hope someone else can suggest
> a simple solution.
;-)
> > > Please tell me I am wrong!
> >
> > It would take a braver man than me to say that Oleg Nesterov is wrong!
>
> Hmm. Given that you proved I was wrong in the first lines of your
> email, I do not know what to think ;)
Ah, but I didn't prove you wrong. Instead, I simply asked if you
were wrong. You supplied the proof. ;-)
Thanx, Paul
Oleg Nesterov <[email protected]> writes:
> On 06/21, Roland McGrath wrote:
>>
>> > Paul, Roland, do you see any problems from the correctness pov,
>> > or a better fix for now?
>> >
>> > Perhaps it also makes sense to keep the old variant renamed to
>> > while_each_thread_locked(), I dunno.
>>
>> Did we verify that only de_thread() can create the situation where a
>> while_each_thread-style loop without either lock can be confused?
>
> I think yes, this is is the only case.
>
> I mean, while_each_thread(group_leader, t). If g != group_leader, then
> the lockless while_each_thread() has problems with the plain exit(g).
>
> Afaics. The more I think about this, the more I feel confused ;)
>
> But if we start from ->group_leader, then while_each_thread() must
> stop eventually. Otherwise we should assume that the dead (unhashed)
> tasks can create the circular list, obviously this is not possible.
>
>> If
>> that's so, then just changing it to avoid the situation seems like it
>> would be less invasive overall.
>
> How? We should change ->group_leader uner write_lock_irq(tasklist),
> synchronize_rcu() is not an option. We can't do call_rcu(release_task),
> we can't take tasklist for writing in the softirq context. But even
> if we could, this can't help in fact or I missed something.
We already do: call_rcu(&p->rcu, delayed_put_task_struct); in release_task.
We don't call release_task until after we have removed it as leader and
dropped the write lock.
At first glance it sounds like the group leader is safe as a stopping
point for a rcu while_each_thread, and I expect the fact that
de_thread takes everything down to a single thread, could have nice
properties here. If pid_alive were only to fail on the group leader
when de_thread is called I think we could legitimately say that an event
we won't worry about. It is close enough to a new thread being created
anyway.
Eric
"Paul E. McKenney" <[email protected]> writes:
> On Mon, Jun 21, 2010 at 07:09:19PM +0200, Oleg Nesterov wrote:
>> On 06/18, Paul E. McKenney wrote:
>> >
>> > On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
>> > >
>> > > #define XXX(t) ({
>> > > struct task_struct *__prev = t;
>> > > t = next_thread(t);
>> > > t != g && t != __prev;
>> > > })
>> > >
>> > > #define while_each_thread(g, t) \
>> > > while (XXX(t))
>> >
>> > Isn't the above vulnerable to a pthread_create() immediately following
>> > the offending exec()? Especially if the task doing the traversal is
>> > preempted?
>>
>> Yes, thanks!
>>
>> > here are some techniques that might (or might not) help:
>>
>> To simplify, let's consider the concrete example,
>
> Sounds very good!
>
>> rcu_read_lock();
>>
>> g = t = returns_the_rcu_safe_task_struct_ptr();
>
> This returns a pointer to the task struct of the current thread?
> Or might this return a pointer some other thread's task struct?
>
>> do {
>> printk("%d\n", t->pid);
>> } while_each_thread(g, t);
>>
>> rcu_read_unlock();
>>
>> Whatever we do, without tasklist/siglock this can obviously race
>> with fork/exit/exec. It is OK to miss a thread, or print the pid
>> of the already exited/released task.
>>
>> But it should not loop forever (the subject), and it should not
>> print the same pid twice (ignoring pid reuse, of course).
>>
>> And, afaics, there are no problems with rcu magic per se, next_thread()
>> always returns the task_struct we can safely dereference. The only
>> problem is that while_each_thread() assumes that sooner or later
>> next_thread() must reach the starting point, g.
>>
>> (zap_threads() is different, it must not miss a thread with ->mm
>> we are going to dump, but it holds mmap_sem).
>
> Indeed, the tough part is figuring out when you are done given that things
> can come and go at will. Some additional tricks, in no particular order:
>
> 1. Always start at the group leader. Of course, the group leader
> is probably permitted to leave any time it wants to, so this
> is not sufficient in and of itself.
No. The group leader must exist as long as the group exists.
Modulo de_thread weirdness. The group_leader can be a zombie but
it can not go away completely.
> 2. Maintain a separate task structure that flags the head of the
> list. This separate structure is freed one RCU grace period
> following the disappearance of the current group leader. This
> should be quite robust, but "holy overhead, Batman!!!" (Apologies
> for the American pop culture reference, but nothing else seemed
> appropriate.)
That is roughly what we have in the group leader right now.
Eric
On Mon, Jun 21, 2010 at 02:22:59PM -0700, Eric W. Biederman wrote:
> "Paul E. McKenney" <[email protected]> writes:
>
> > On Mon, Jun 21, 2010 at 07:09:19PM +0200, Oleg Nesterov wrote:
> >> On 06/18, Paul E. McKenney wrote:
> >> >
> >> > On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
> >> > >
> >> > > #define XXX(t) ({
> >> > > struct task_struct *__prev = t;
> >> > > t = next_thread(t);
> >> > > t != g && t != __prev;
> >> > > })
> >> > >
> >> > > #define while_each_thread(g, t) \
> >> > > while (XXX(t))
> >> >
> >> > Isn't the above vulnerable to a pthread_create() immediately following
> >> > the offending exec()? Especially if the task doing the traversal is
> >> > preempted?
> >>
> >> Yes, thanks!
> >>
> >> > here are some techniques that might (or might not) help:
> >>
> >> To simplify, let's consider the concrete example,
> >
> > Sounds very good!
> >
> >> rcu_read_lock();
> >>
> >> g = t = returns_the_rcu_safe_task_struct_ptr();
> >
> > This returns a pointer to the task struct of the current thread?
> > Or might this return a pointer some other thread's task struct?
> >
> >> do {
> >> printk("%d\n", t->pid);
> >> } while_each_thread(g, t);
> >>
> >> rcu_read_unlock();
> >>
> >> Whatever we do, without tasklist/siglock this can obviously race
> >> with fork/exit/exec. It is OK to miss a thread, or print the pid
> >> of the already exited/released task.
> >>
> >> But it should not loop forever (the subject), and it should not
> >> print the same pid twice (ignoring pid reuse, of course).
> >>
> >> And, afaics, there are no problems with rcu magic per se, next_thread()
> >> always returns the task_struct we can safely dereference. The only
> >> problem is that while_each_thread() assumes that sooner or later
> >> next_thread() must reach the starting point, g.
> >>
> >> (zap_threads() is different, it must not miss a thread with ->mm
> >> we are going to dump, but it holds mmap_sem).
> >
> > Indeed, the tough part is figuring out when you are done given that things
> > can come and go at will. Some additional tricks, in no particular order:
> >
> > 1. Always start at the group leader. Of course, the group leader
> > is probably permitted to leave any time it wants to, so this
> > is not sufficient in and of itself.
>
> No. The group leader must exist as long as the group exists.
> Modulo de_thread weirdness. The group_leader can be a zombie but
> it can not go away completely.
Ah, OK -- now that you mention it, all the thinks that I can think of
that remove a thread from a group have the side effect of destroying
the old group (exec() and exit()). Other things that create a new thread
group leave the old thread group intact.
Or am I forgetting some odd operation?
> > 2. Maintain a separate task structure that flags the head of the
> > list. This separate structure is freed one RCU grace period
> > following the disappearance of the current group leader. This
> > should be quite robust, but "holy overhead, Batman!!!" (Apologies
> > for the American pop culture reference, but nothing else seemed
> > appropriate.)
>
> That is roughly what we have in the group leader right now.
But can't the group leader do an exec(), becoming the leader of a new
thread group without waiting for a grace period? Or this possibility
already covered?
Thanx, Paul
On 06/21, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> >> If
> >> that's so, then just changing it to avoid the situation seems like it
> >> would be less invasive overall.
> >
> > How? We should change ->group_leader uner write_lock_irq(tasklist),
> > synchronize_rcu() is not an option. We can't do call_rcu(release_task),
> > we can't take tasklist for writing in the softirq context. But even
> > if we could, this can't help in fact or I missed something.
>
> We already do: call_rcu(&p->rcu, delayed_put_task_struct); in release_task.
> We don't call release_task until after we have removed it as leader and
> dropped the write lock.
Yes. I meant that while this is needed to ensure that next_thread/etc
returns the rcu-safe data, this (or more rcu_call's) can help to fix
while_each_thread.
I think I was unclear. de_thread() changes ->group_leader, but this does
not matter at all. I mentioned this only because we discussed the possibility
to check ->group_leader in while_each_thread.
What does matter is the single line in __unhash_process()
list_del_rcu(&p->thread_group);
this breaks while_each_thread().
IOW. Why list_for_each_rcu(head) actually works? It works because this
head itself can't be removed from list.
while_each_thread(g, t) is almost equal to list_for_each_rcu() and it
can only work if g can't be removed from list (OK, strictly speaking
until other sub-threads go away, but this doesn't matter).
However, g can be removed if a) it is not ->group_leader and exits,
or b) its subthread execs.
> At first glance it sounds like the group leader is safe as a stopping
> point for a rcu while_each_thread, and I expect the fact that
> de_thread takes everything down to a single thread, could have nice
> properties here. If pid_alive were only to fail on the group leader
> when de_thread is called I think we could legitimately say that an event
> we won't worry about. It is close enough to a new thread being created
> anyway.
Not sure I understand exactly... I mean, I don't understand whether
you agree or not with the fix which adds pid_alive() check into
next_thread_careful().
Oleg.
On 06/21, Paul E. McKenney wrote:
>
> Indeed, the tough part is figuring out when you are done given that things
> can come and go at will. Some additional tricks, in no particular order:
>
> 1. Always start at the group leader.
We can't. We have users which start at the arbitrary thread.
> 2. Maintain a separate task structure that flags the head of the
> list. This separate structure is freed one RCU grace period
> following the disappearance of the current group leader.
Even simpler, we can just add list_head into signal_struct. I thought
about this, but this breaks thread_group_empty (this is fixeable) and,
again, I'd like very much to avoid adding new fields into task_struct
or signal_struct.
> > Well, another field in task_struct...
>
> Yeah, would be good to avoid this. Not sure it can be avoided, though.
Why? I think next_thread_careful() from
http://marc.info/?l=linux-kernel&m=127714242731448
should work.
If the caller holds tasklist or siglock, this change has no effect.
If the caller does while_each_thread() under rcu_read_lock(), then
it is OK to break the loop earlier than we do now. The lockless
while_each_thread() works in a "best effort" manner anyway, if it
races with exit_group() or exec() it can miss some/most/all sub-threads
(including the new leader) with or without this change.
Yes, zap_threads() needs additional fixes. But I think it is better
to complicate a couple of lockless callers (or just change them
to take tasklist) which must not miss an "interesting" thread.
> > > o Do the de_thread() incrementally. So if the list is tasks A,
> > > B, and C, in that order, and if we are de-thread()ing B,
> > > then make A's pointer refer to C,
> >
> > This breaks while_each_thread() under tasklist/siglock. It must
> > see all unhashed tasks.
>
> Could de_thread() hold those locks in order to avoid that breakage?
How can it hold, say, siglock? We need to wait a grace period.
To clarify. de_thread() kills all threads except the group_leader,
so we have only 2 threads: group_leader A and B.
If we add synchronize_rcu() before release_task(leader) (as Roland
suggested), then we don't need to change A's pointer. This probably
fixes while_each_thread() in the common case. But this disallows
the tricks like rcu_lock_break().
And. Whatever we do with de_thread(), this can't fix the lockless
while_each_thread(not_a_group_leader, t). I do not know if there is
any user which does this though.
fastpath_timer_check()->thread_group_cputimer() does this, but this
is wrong and we already have the patch which removes it.
Oleg.
On Tue, Jun 22, 2010 at 11:23:57PM +0200, Oleg Nesterov wrote:
> On 06/21, Paul E. McKenney wrote:
> >
> > Indeed, the tough part is figuring out when you are done given that things
> > can come and go at will. Some additional tricks, in no particular order:
> >
> > 1. Always start at the group leader.
>
> We can't. We have users which start at the arbitrary thread.
OK.
> > 2. Maintain a separate task structure that flags the head of the
> > list. This separate structure is freed one RCU grace period
> > following the disappearance of the current group leader.
>
> Even simpler, we can just add list_head into signal_struct. I thought
> about this, but this breaks thread_group_empty (this is fixeable) and,
> again, I'd like very much to avoid adding new fields into task_struct
> or signal_struct.
Understood.
> > > Well, another field in task_struct...
> >
> > Yeah, would be good to avoid this. Not sure it can be avoided, though.
>
> Why? I think next_thread_careful() from
> http://marc.info/?l=linux-kernel&m=127714242731448
> should work.
>
> If the caller holds tasklist or siglock, this change has no effect.
>
> If the caller does while_each_thread() under rcu_read_lock(), then
> it is OK to break the loop earlier than we do now. The lockless
> while_each_thread() works in a "best effort" manner anyway, if it
> races with exit_group() or exec() it can miss some/most/all sub-threads
> (including the new leader) with or without this change.
>
> Yes, zap_threads() needs additional fixes. But I think it is better
> to complicate a couple of lockless callers (or just change them
> to take tasklist) which must not miss an "interesting" thread.
Is it the case that creating a new group leader from an existing group
always destroys the old group? It certainly is the case for exec().
In my earlier emails, I was assuming that it was possible to create a
new thread group without destroying the old one, and that the thread
group leader might leave the thread group and form a new one, so that a
new thread group leader would be selected for the old group. I suspect
that I was confused. ;-)
Anyway, if creating a new thread group implies destroying the old one,
and if the thread group leader cannot be concurrently creating a new
thread group and traversing the old one, then yes, I do believe your
code at http://marc.info/?l=linux-kernel&m=127714242731448 will work.
Assuming that the call to next_thread_careful(t) in the definition of
while_each_thread() is replaced with next_thread_careful(g,t).
And give or take memory barriers.
The implied memory barrier mentioned in the comment in your example code
is the spin_lock_irqsave() and spin_unlock_irqrestore() in free_pid(),
which is called from __change_pid() which is called from detach_pid()?
One some platforms, code may be reordered from both sides into the
resulting critical section, so you actually need two non-overlapping
lock-based critical sections to guarantee full memory-barrier semantics.
And the atomic_inc() in free_pidmap() is not required to provide
memory-barrier semantics, and does not do so on all platforms.
Or does the implied memory barrier correspond to the first of three calls
to detach_pid() in __unhash_process()? (The above analysis assumes that
it corresponds to the last of the three.)
> > > > o Do the de_thread() incrementally. So if the list is tasks A,
> > > > B, and C, in that order, and if we are de-thread()ing B,
> > > > then make A's pointer refer to C,
> > >
> > > This breaks while_each_thread() under tasklist/siglock. It must
> > > see all unhashed tasks.
> >
> > Could de_thread() hold those locks in order to avoid that breakage?
>
> How can it hold, say, siglock? We need to wait a grace period.
> To clarify. de_thread() kills all threads except the group_leader,
> so we have only 2 threads: group_leader A and B.
>
> If we add synchronize_rcu() before release_task(leader) (as Roland
> suggested), then we don't need to change A's pointer. This probably
> fixes while_each_thread() in the common case. But this disallows
> the tricks like rcu_lock_break().
>
>
> And. Whatever we do with de_thread(), this can't fix the lockless
> while_each_thread(not_a_group_leader, t). I do not know if there is
> any user which does this though.
> fastpath_timer_check()->thread_group_cputimer() does this, but this
> is wrong and we already have the patch which removes it.
Indeed. Suppose that the starting task is the one immediately preceding
the task group leader. You get a pointer to the task in question
and traverse to the next task (the task group leader), during which
time the thread group leader does exec() and maybe a pthread_create()
or two. Oops! You are now now traversing the wrong thread group!
There are ways of fixing this, but all the ones I know of require more
fields in the task structure, so best if we don't need to start somewhere
other than a group leader.
Thanx, Paul
On 06/22, Paul E. McKenney wrote:
>
> On Tue, Jun 22, 2010 at 11:23:57PM +0200, Oleg Nesterov wrote:
> > On 06/21, Paul E. McKenney wrote:
> > >
> > > Yeah, would be good to avoid this. Not sure it can be avoided, though.
> >
> > Why? I think next_thread_careful() from
> > http://marc.info/?l=linux-kernel&m=127714242731448
> > should work.
> >
> > If the caller holds tasklist or siglock, this change has no effect.
> >
> > If the caller does while_each_thread() under rcu_read_lock(), then
> > it is OK to break the loop earlier than we do now. The lockless
> > while_each_thread() works in a "best effort" manner anyway, if it
> > races with exit_group() or exec() it can miss some/most/all sub-threads
> > (including the new leader) with or without this change.
> >
> > Yes, zap_threads() needs additional fixes. But I think it is better
> > to complicate a couple of lockless callers (or just change them
> > to take tasklist) which must not miss an "interesting" thread.
>
> Is it the case that creating a new group leader from an existing group
> always destroys the old group? It certainly is the case for exec().
Yes. And only exec() can change the leader.
> Anyway, if creating a new thread group implies destroying the old one,
> and if the thread group leader cannot be concurrently creating a new
> thread group and traversing the old one, then yes, I do believe your
> code at http://marc.info/?l=linux-kernel&m=127714242731448 will work.
>
> Assuming that the call to next_thread_careful(t) in the definition of
> while_each_thread() is replaced with next_thread_careful(g,t).
Great.
> And give or take memory barriers.
>
> The implied memory barrier mentioned in the comment in your example code
> is the spin_lock_irqsave() and spin_unlock_irqrestore()
Argh. No, no, no. I meant unlock + lock, not lock + unlock. But when
I look at __change_pid() now I do not see the 2nd lock() after
unlock(pidmap_lock). I was wrong, thanks.
And, even if I was correct it is not nice to rely on the internals
of detach_pid(). If we use next_thread_careful(), __unhash_process()
needs wmb.
Thanks!
> > And. Whatever we do with de_thread(), this can't fix the lockless
> > while_each_thread(not_a_group_leader, t). I do not know if there is
> > any user which does this though.
> > fastpath_timer_check()->thread_group_cputimer() does this, but this
> > is wrong and we already have the patch which removes it.
>
> Indeed. Suppose that the starting task is the one immediately preceding
> the task group leader. You get a pointer to the task in question
> and traverse to the next task (the task group leader), during which
> time the thread group leader does exec() and maybe a pthread_create()
> or two. Oops! You are now now traversing the wrong thread group!
Yes, but there is another more simple and much more feasible race.
while_each_thread(non_leader, t) will loop forever if non_leader
simply exits and passes __unhash_process(). After that next_thread(t)
can never return non_leader.
> There are ways of fixing this, but all the ones I know of require more
> fields in the task structure,
Just in case, I hope that next_thread_careful() handles this case too.
> so best if we don't need to start somewhere
> other than a group leader.
(I assume, you mean the lockless case)
I am not sure. Because it is not trivial to enforce this rule even if
we add a fat comment. Please look at check_hung_uninterruptible_tasks().
It does do_each_thread/while_each_thread and thus it always starts at
the leader. But in fact this is not true due to rcu_lock_break().
Or proc_task_readdir(). It finds the leader, does get_task_struct(leader)
and drops rcu lock. After that first_tid() takes rcu_read_lock() again
and does the open coded while_each_thread(). At first glance this case
looks fine, "for (pos = leader; nr > 0; --nr)" can't loop forever.
But in fact it is not:
- proc_task_readdir() finds the leader L, does get_task_struct()
and drops rcu lock.
Suppose that filp->f_pos >= 2.
Suppose also that the previous tid cached in filp->f_version
is no longer valid.
The caller is preempted.
- a sub-thread T does exec, and does release_task(L).
But L->thread_group->next == T, so far everything is good
- T spawns other sub-threads (so that get_nr_threads() > nr)
and exits.
- grace period passes, T's task_struct is freed/unmapped/reused
- proc_task_readdir() resumes and calls first_tid(L).
next_thread(L) returns T == nowhere
It is very possible that I missed something here, my only point is
that I think it would be safer to assume nothing about the leaderness.
Oleg.
On Wed, Jun 23, 2010 at 05:24:21PM +0200, Oleg Nesterov wrote:
> On 06/22, Paul E. McKenney wrote:
> >
> > On Tue, Jun 22, 2010 at 11:23:57PM +0200, Oleg Nesterov wrote:
> > > On 06/21, Paul E. McKenney wrote:
> > > >
> > > > Yeah, would be good to avoid this. Not sure it can be avoided, though.
> > >
> > > Why? I think next_thread_careful() from
> > > http://marc.info/?l=linux-kernel&m=127714242731448
> > > should work.
> > >
> > > If the caller holds tasklist or siglock, this change has no effect.
> > >
> > > If the caller does while_each_thread() under rcu_read_lock(), then
> > > it is OK to break the loop earlier than we do now. The lockless
> > > while_each_thread() works in a "best effort" manner anyway, if it
> > > races with exit_group() or exec() it can miss some/most/all sub-threads
> > > (including the new leader) with or without this change.
> > >
> > > Yes, zap_threads() needs additional fixes. But I think it is better
> > > to complicate a couple of lockless callers (or just change them
> > > to take tasklist) which must not miss an "interesting" thread.
> >
> > Is it the case that creating a new group leader from an existing group
> > always destroys the old group? It certainly is the case for exec().
>
> Yes. And only exec() can change the leader.
>
> > Anyway, if creating a new thread group implies destroying the old one,
> > and if the thread group leader cannot be concurrently creating a new
> > thread group and traversing the old one, then yes, I do believe your
> > code at http://marc.info/?l=linux-kernel&m=127714242731448 will work.
> >
> > Assuming that the call to next_thread_careful(t) in the definition of
> > while_each_thread() is replaced with next_thread_careful(g,t).
>
> Great.
>
> > And give or take memory barriers.
> >
> > The implied memory barrier mentioned in the comment in your example code
> > is the spin_lock_irqsave() and spin_unlock_irqrestore()
>
> Argh. No, no, no. I meant unlock + lock, not lock + unlock. But when
> I look at __change_pid() now I do not see the 2nd lock() after
> unlock(pidmap_lock). I was wrong, thanks.
>
> And, even if I was correct it is not nice to rely on the internals
> of detach_pid(). If we use next_thread_careful(), __unhash_process()
> needs wmb.
>
> Thanks!
>
> > > And. Whatever we do with de_thread(), this can't fix the lockless
> > > while_each_thread(not_a_group_leader, t). I do not know if there is
> > > any user which does this though.
> > > fastpath_timer_check()->thread_group_cputimer() does this, but this
> > > is wrong and we already have the patch which removes it.
> >
> > Indeed. Suppose that the starting task is the one immediately preceding
> > the task group leader. You get a pointer to the task in question
> > and traverse to the next task (the task group leader), during which
> > time the thread group leader does exec() and maybe a pthread_create()
> > or two. Oops! You are now now traversing the wrong thread group!
>
> Yes, but there is another more simple and much more feasible race.
> while_each_thread(non_leader, t) will loop forever if non_leader
> simply exits and passes __unhash_process(). After that next_thread(t)
> can never return non_leader.
>
> > There are ways of fixing this, but all the ones I know of require more
> > fields in the task structure,
>
> Just in case, I hope that next_thread_careful() handles this case too.
>
> > so best if we don't need to start somewhere
> > other than a group leader.
>
> (I assume, you mean the lockless case)
>
> I am not sure. Because it is not trivial to enforce this rule even if
> we add a fat comment. Please look at check_hung_uninterruptible_tasks().
> It does do_each_thread/while_each_thread and thus it always starts at
> the leader. But in fact this is not true due to rcu_lock_break().
>
> Or proc_task_readdir(). It finds the leader, does get_task_struct(leader)
> and drops rcu lock. After that first_tid() takes rcu_read_lock() again
> and does the open coded while_each_thread(). At first glance this case
> looks fine, "for (pos = leader; nr > 0; --nr)" can't loop forever.
> But in fact it is not:
>
> - proc_task_readdir() finds the leader L, does get_task_struct()
> and drops rcu lock.
>
> Suppose that filp->f_pos >= 2.
> Suppose also that the previous tid cached in filp->f_version
> is no longer valid.
>
> The caller is preempted.
>
> - a sub-thread T does exec, and does release_task(L).
>
> But L->thread_group->next == T, so far everything is good
>
> - T spawns other sub-threads (so that get_nr_threads() > nr)
> and exits.
>
> - grace period passes, T's task_struct is freed/unmapped/reused
>
> - proc_task_readdir() resumes and calls first_tid(L).
>
> next_thread(L) returns T == nowhere
>
> It is very possible that I missed something here, my only point is
> that I think it would be safer to assume nothing about the leaderness.
It is past time that I list out my assumptions more carefully. ;-)
First, what "bad things" can happen to a reader scanning a thread
group?
1. The thread-group leader might do exec(), destroying the old
list and forming a new one. In this case, we want any readers
to stop scanning.
2. Some other thread might do exec(), destroying the old list and
forming a new one. In this case, we also want any readers to
stop scanning.
3. The thread-group leader might do pthread_exit(), removing itself
from the thread group -- and might do so while the hapless reader
is referencing that thread.
But isn't this prohibited? Or is it really legal to do a
pthread_create() to create a new thread and then have the
parent thread call pthread_exit()? Not something I would
consider trying in my own code! Well, I might, just to
be perverse, but... ;-)
4. Some other thread might do pthread_exit(), removing itself
from the thread group, and again might do so while the hapless
reader is referencing that thread. In this case, we want
the hapless reader to continue scanning the remainder of the
thread group.
5. The thread-group leader might do exit(), destroying the old
list without forming a new one. In this case, we want any
readers to stop scanning.
6. Some other thread might do exit(), destroying the old list
without forming a new one. In this case, we also want any
readers to stop scanning.
Anything else I might be missing?
And proc_task_readdir() does look quite interesting, and I don't claim
to understand it yet.
Thanx, Paul
On 06/24/2010 12:07 PM, Paul E. McKenney wrote:
> 3. The thread-group leader might do pthread_exit(), removing itself
> from the thread group -- and might do so while the hapless reader
> is referencing that thread.
>
> But isn't this prohibited? Or is it really legal to do a
> pthread_create() to create a new thread and then have the
> parent thread call pthread_exit()? Not something I would
> consider trying in my own code! Well, I might, just to
> be perverse, but... ;-)
I believe SUS allows the main thread to explicitly call pthread_exit(),
leaving the other threads to run. If the main() routine just returns
then it implicitly calls exit().
Chris
--
Chris Friesen
Software Developer
GENBAND
[email protected]
http://www.genband.com
> First, what "bad things" can happen to a reader scanning a thread
> group?
>
> 1. The thread-group leader might do exec(), destroying the old
> list and forming a new one. In this case, we want any readers
> to stop scanning.
This doesn't do anything different (for these concerns) from just all the
other threads happening to exit right before the exec. There is no
"destroying the old" and "forming the new", it's just that all the other
threads are convinced to die now. There is no problem here.
> 2. Some other thread might do exec(), destroying the old list and
> forming a new one. In this case, we also want any readers to
> stop scanning.
Again, the list is not really destroyed, just everybody dies. What is
different here is that ->group_leader changes. This is the only time
that ever happens. Moreover, it's the only time that a task that was
previously pointed to by any ->group_leader can be reaped before the
rest of the group has already been reaped first (and thus the
thread_group made a singleton).
> 3. The thread-group leader might do pthread_exit(), removing itself
> from the thread group -- and might do so while the hapless reader
> is referencing that thread.
This is called the delay_group_leader() case. It doesn't happen in a
way that has the problems you are concerned with. The group_leader
remains in EXIT_ZOMBIE state and can't be reaped until all the other
threads have been reaped. There is no time at which any thread in the
group is in any hashes or accessible by any means after the (final)
group_leader is reaped.
> 4. Some other thread might do pthread_exit(), removing itself
> from the thread group, and again might do so while the hapless
> reader is referencing that thread. In this case, we want
> the hapless reader to continue scanning the remainder of the
> thread group.
This is the most normal case (and #1 is effectively just this repeated
by every thread in parallel).
> 5. The thread-group leader might do exit(), destroying the old
> list without forming a new one. In this case, we want any
> readers to stop scanning.
All this means is everybody is convinced to die, and the group_leader
dies too. It is not discernibly different from #6.
> 6. Some other thread might do exit(), destroying the old list
> without forming a new one. In this case, we also want any
> readers to stop scanning.
This just means everybody is convinced to die and is not materially
different from each individual thread all happening to die at the same
time.
You've described all these cases as "we want any readers to stop
scanning". That is far too strong, and sounds like some kind of
guaranteed synchronization, which does not and need not exist. Any
reader that needs a dead thread to be off the list holds siglock
and/or tasklist_lock. For the casual readers that only use
rcu_read_lock, we only "want any readers' loops eventually to
terminate and never to dereference stale pointers". That's why
normal RCU listiness is generally fine.
The only problem we have is in #2. This is only a problem because
readers' loops may be using the old ->group_leader pointer as the
anchor for their circular-list round-robin loop. Once the former
leader is removed from the list, that loop termination condition can
never be met.
Thanks,
Roland
On 06/24, Paul E. McKenney wrote:
>
> On Wed, Jun 23, 2010 at 05:24:21PM +0200, Oleg Nesterov wrote:
> > It is very possible that I missed something here, my only point is
> > that I think it would be safer to assume nothing about the leaderness.
>
> It is past time that I list out my assumptions more carefully. ;-)
>
> First, what "bad things" can happen to a reader scanning a thread
> group?
(I assume you mean the lockless case)
Currently, the only bad thing is that while_each_thread(g) can loop
forever if we race with exec(), or exit() if g is not leader.
And, to simplify, let's consider the same example again
t = g;
do {
printk("pid %d\n", t->pid);
} while_each_thread(g, t);
> 1. The thread-group leader might do exec(), destroying the old
> list and forming a new one. In this case, we want any readers
> to stop scanning.
I'd say, it is not that we want to stop scanning, it is OK to stop
scanning after we printed g->pid
> 2. Some other thread might do exec(), destroying the old list and
> forming a new one. In this case, we also want any readers to
> stop scanning.
The same.
If the code above runs under for_each_process(g) or it did
"g = find_task_by_pid(tgid)", we will see either new or old leader
and print its pid at least.
> 3. The thread-group leader might do pthread_exit(), removing itself
> from the thread group
No. It can exit, but it won't be removed from thread group. It will
be zombie untill all sub-threads disappear.
> 4. Some other thread might do pthread_exit(), removing itself
> from the thread group, and again might do so while the hapless
> reader is referencing that thread. In this case, we want
> the hapless reader to continue scanning the remainder of the
> thread group.
Yes.
But, if that thread was used as a starting point g, then
before the patch: loop forever
after the patch: break
> 5. The thread-group leader might do exit(), destroying the old
> list without forming a new one. In this case, we want any
> readers to stop scanning.
>
> 6. Some other thread might do exit(), destroying the old list
> without forming a new one. In this case, we also want any
> readers to stop scanning.
Yes. But again, it is fine to print more pids as far as we know it
is safe to iterate over the exiting thread group. However,
next_thread_careful() can stop earlier compared to next_thread().
Either way, we can miss none/some/most/all threads if we race with
exit_group().
> Anything else I might be missing?
I think this is all.
Oleg.
On 06/24, Chris Friesen wrote:
>
> On 06/24/2010 12:07 PM, Paul E. McKenney wrote:
>
> > 3. The thread-group leader might do pthread_exit(), removing itself
> > from the thread group -- and might do so while the hapless reader
> > is referencing that thread.
> >
> > But isn't this prohibited? Or is it really legal to do a
> > pthread_create() to create a new thread and then have the
> > parent thread call pthread_exit()? Not something I would
> > consider trying in my own code! Well, I might, just to
> > be perverse, but... ;-)
>
> I believe SUS allows the main thread to explicitly call pthread_exit(),
> leaving the other threads to run. If the main() routine just returns
> then it implicitly calls exit().
Correct.
But, to clarify, if the main thread does pthread_exit() (sys_exit,
actually), it won't be removed from the group. It will be zombie
until all other threads exit.
Oleg.
Oleg Nesterov <[email protected]> writes:
> On 06/24, Chris Friesen wrote:
>>
>> On 06/24/2010 12:07 PM, Paul E. McKenney wrote:
>>
>> > 3. The thread-group leader might do pthread_exit(), removing itself
>> > from the thread group -- and might do so while the hapless reader
>> > is referencing that thread.
>> >
>> > But isn't this prohibited? Or is it really legal to do a
>> > pthread_create() to create a new thread and then have the
>> > parent thread call pthread_exit()? Not something I would
>> > consider trying in my own code! Well, I might, just to
>> > be perverse, but... ;-)
>>
>> I believe SUS allows the main thread to explicitly call pthread_exit(),
>> leaving the other threads to run. If the main() routine just returns
>> then it implicitly calls exit().
>
> Correct.
>
> But, to clarify, if the main thread does pthread_exit() (sys_exit,
> actually), it won't be removed from the group. It will be zombie
> until all other threads exit.
That we don't cleanup that zombie leaders is unfortunate really, it
means we have the entire de_thread special case. But short fixing
libpthread to not make bad assumptions there is little we can do about
it really.
I'm only half following this conversation.
If what we are looking for is a stable list_head that won't disappear
on us we should be able to put one in sighand_struct or signal_struct
(I forget which is which at the moment) and have a list_head that
lives for the life of the longest living thread, and that won't get
messed up by things like de_thread, and then next_thread could simply
return NULL when we hit the end of the list.
Eric
On Thu, Jun 24, 2010 at 02:14:46PM -0700, Roland McGrath wrote:
> > First, what "bad things" can happen to a reader scanning a thread
> > group?
> >
> > 1. The thread-group leader might do exec(), destroying the old
> > list and forming a new one. In this case, we want any readers
> > to stop scanning.
>
> This doesn't do anything different (for these concerns) from just all the
> other threads happening to exit right before the exec. There is no
> "destroying the old" and "forming the new", it's just that all the other
> threads are convinced to die now. There is no problem here.
Understood -- I wasn't saying that each category posed a unique problem,
but rather making sure that I really had enumerated all the possibilities.
The reason for my "destroying the old" and "forming the new" is the
possibility of someone doing proc_task_readdir() when the group leader
does exec(), which causes all to die, and then the new process does
pthread_create(), forming a new thread group. Because proc_task_readdir()
neither holds a lock nor stays in an RCU read-side critical section
for the full /proc scan, the old group might really be destroyed from
the reader's point of view.
That said, I freely admit that I am not very familiar with this code.
> > 2. Some other thread might do exec(), destroying the old list and
> > forming a new one. In this case, we also want any readers to
> > stop scanning.
>
> Again, the list is not really destroyed, just everybody dies. What is
> different here is that ->group_leader changes. This is the only time
> that ever happens. Moreover, it's the only time that a task that was
> previously pointed to by any ->group_leader can be reaped before the
> rest of the group has already been reaped first (and thus the
> thread_group made a singleton).
Yep! Same proc_task_readdir() situation as before. The group leader
cannot go away because proc_task_readdir() takes a reference.
> > 3. The thread-group leader might do pthread_exit(), removing itself
> > from the thread group -- and might do so while the hapless reader
> > is referencing that thread.
>
> This is called the delay_group_leader() case. It doesn't happen in a
> way that has the problems you are concerned with. The group_leader
> remains in EXIT_ZOMBIE state and can't be reaped until all the other
> threads have been reaped. There is no time at which any thread in the
> group is in any hashes or accessible by any means after the (final)
> group_leader is reaped.
OK, good to know -- that does make things simpler.
> > 4. Some other thread might do pthread_exit(), removing itself
> > from the thread group, and again might do so while the hapless
> > reader is referencing that thread. In this case, we want
> > the hapless reader to continue scanning the remainder of the
> > thread group.
>
> This is the most normal case (and #1 is effectively just this repeated
> by every thread in parallel).
Agreed. One possible difference is that in #1, no one is going to
complain about the reader quitting, while in this case someone might
well be annoyed if the list of threads is incomplete.
> > 5. The thread-group leader might do exit(), destroying the old
> > list without forming a new one. In this case, we want any
> > readers to stop scanning.
>
> All this means is everybody is convinced to die, and the group_leader
> dies too. It is not discernibly different from #6.
Seems reasonable.
> > 6. Some other thread might do exit(), destroying the old list
> > without forming a new one. In this case, we also want any
> > readers to stop scanning.
>
> This just means everybody is convinced to die and is not materially
> different from each individual thread all happening to die at the same
> time.
Fair enough. Again, my goal was to ensure that I had covered all the
cases as opposed to ensuring that I had described them minimally.
> You've described all these cases as "we want any readers to stop
> scanning". That is far too strong, and sounds like some kind of
> guaranteed synchronization, which does not and need not exist. Any
> reader that needs a dead thread to be off the list holds siglock
> and/or tasklist_lock. For the casual readers that only use
> rcu_read_lock, we only "want any readers' loops eventually to
> terminate and never to dereference stale pointers". That's why
> normal RCU listiness is generally fine.
OK, so maybe "it is OK for readers to stop scanning" is a better way
of putting it?
> The only problem we have is in #2. This is only a problem because
> readers' loops may be using the old ->group_leader pointer as the
> anchor for their circular-list round-robin loop. Once the former
> leader is removed from the list, that loop termination condition can
> never be met.
Does Oleg's checking for the group leader still being alive look correct
to you?
Thanx, Paul
On Thu, Jun 24, 2010 at 11:57:02PM +0200, Oleg Nesterov wrote:
> On 06/24, Paul E. McKenney wrote:
> >
> > On Wed, Jun 23, 2010 at 05:24:21PM +0200, Oleg Nesterov wrote:
> > > It is very possible that I missed something here, my only point is
> > > that I think it would be safer to assume nothing about the leaderness.
> >
> > It is past time that I list out my assumptions more carefully. ;-)
> >
> > First, what "bad things" can happen to a reader scanning a thread
> > group?
>
> (I assume you mean the lockless case)
You are quite right -- I should have stated that explicitly.
> Currently, the only bad thing is that while_each_thread(g) can loop
> forever if we race with exec(), or exit() if g is not leader.
>
> And, to simplify, let's consider the same example again
>
> t = g;
> do {
> printk("pid %d\n", t->pid);
> } while_each_thread(g, t);
>
>
> > 1. The thread-group leader might do exec(), destroying the old
> > list and forming a new one. In this case, we want any readers
> > to stop scanning.
>
> I'd say, it is not that we want to stop scanning, it is OK to stop
> scanning after we printed g->pid
Fair enough.
> > 2. Some other thread might do exec(), destroying the old list and
> > forming a new one. In this case, we also want any readers to
> > stop scanning.
>
> The same.
>
> If the code above runs under for_each_process(g) or it did
> "g = find_task_by_pid(tgid)", we will see either new or old leader
> and print its pid at least.
OK.
> > 3. The thread-group leader might do pthread_exit(), removing itself
> > from the thread group
>
> No. It can exit, but it won't be removed from thread group. It will
> be zombie untill all sub-threads disappear.
This does make things easier! Whew!!! ;-)
> > 4. Some other thread might do pthread_exit(), removing itself
> > from the thread group, and again might do so while the hapless
> > reader is referencing that thread. In this case, we want
> > the hapless reader to continue scanning the remainder of the
> > thread group.
>
> Yes.
>
> But, if that thread was used as a starting point g, then
>
> before the patch: loop forever
> after the patch: break
So it is OK to skip some of the other threads in this case, even
though they were present throughout the whole procedure?
> > 5. The thread-group leader might do exit(), destroying the old
> > list without forming a new one. In this case, we want any
> > readers to stop scanning.
> >
> > 6. Some other thread might do exit(), destroying the old list
> > without forming a new one. In this case, we also want any
> > readers to stop scanning.
>
> Yes. But again, it is fine to print more pids as far as we know it
> is safe to iterate over the exiting thread group. However,
> next_thread_careful() can stop earlier compared to next_thread().
> Either way, we can miss none/some/most/all threads if we race with
> exit_group().
Yes, if there is an exit(), it makes sense that you might not see all
of the threads -- they could reasonably have disappeared before you
got done listing them.
> > Anything else I might be missing?
>
> I think this is all.
OK, thank you (and Roland) for the tutorial!
Thanx, Paul
On Thu, Jun 24, 2010 at 05:08:10PM -0700, Eric W. Biederman wrote:
> Oleg Nesterov <[email protected]> writes:
>
> > On 06/24, Chris Friesen wrote:
> >>
> >> On 06/24/2010 12:07 PM, Paul E. McKenney wrote:
> >>
> >> > 3. The thread-group leader might do pthread_exit(), removing itself
> >> > from the thread group -- and might do so while the hapless reader
> >> > is referencing that thread.
> >> >
> >> > But isn't this prohibited? Or is it really legal to do a
> >> > pthread_create() to create a new thread and then have the
> >> > parent thread call pthread_exit()? Not something I would
> >> > consider trying in my own code! Well, I might, just to
> >> > be perverse, but... ;-)
> >>
> >> I believe SUS allows the main thread to explicitly call pthread_exit(),
> >> leaving the other threads to run. If the main() routine just returns
> >> then it implicitly calls exit().
> >
> > Correct.
> >
> > But, to clarify, if the main thread does pthread_exit() (sys_exit,
> > actually), it won't be removed from the group. It will be zombie
> > until all other threads exit.
>
> That we don't cleanup that zombie leaders is unfortunate really, it
> means we have the entire de_thread special case. But short fixing
> libpthread to not make bad assumptions there is little we can do about
> it really.
Keeping the zombie leaders does make at least one of the lockless
scan cases quite a bit simpler. I think, anyway.
> I'm only half following this conversation.
>
> If what we are looking for is a stable list_head that won't disappear
> on us we should be able to put one in sighand_struct or signal_struct
> (I forget which is which at the moment) and have a list_head that
> lives for the life of the longest living thread, and that won't get
> messed up by things like de_thread, and then next_thread could simply
> return NULL when we hit the end of the list.
Oleg did suggest this possibility, but there were complications that
I do not claim to fully understand.
Thanx, Paul
On 06/24, Paul E. McKenney wrote:
>
> On Thu, Jun 24, 2010 at 11:57:02PM +0200, Oleg Nesterov wrote:
> > On 06/24, Paul E. McKenney wrote:
> > >
> > > 4. Some other thread might do pthread_exit(), removing itself
> > > from the thread group, and again might do so while the hapless
> > > reader is referencing that thread. In this case, we want
> > > the hapless reader to continue scanning the remainder of the
> > > thread group.
> >
> > Yes.
> >
> > But, if that thread was used as a starting point g, then
> >
> > before the patch: loop forever
> > after the patch: break
>
> So it is OK to skip some of the other threads in this case, even
> though they were present throughout the whole procedure?
I think, yes. We can miss them in any case, they can go away before
while_each_thread(g, t) starts the scan.
If g == group_leader (old or new), then we should notice this thread
at least.
Otherwise we can miss them all, with or without next_thread_careful().
Oleg.
On 06/24, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> If what we are looking for is a stable list_head that won't disappear
> on us we should be able to put one in sighand_struct or signal_struct
> (I forget which is which at the moment) and have a list_head that
> lives for the life of the longest living thread, and that won't get
> messed up by things like de_thread, and then next_thread could simply
> return NULL when we hit the end of the list.
This was already discussed.
Yes, we can add list_head into signal_struct. But this breaks, say,
thread_group_empty() and the similar code. This is fixeable.
But. We need to convert the code which does while_each_thread(g, t) to
use list_for_each(t, head). And, if g != group_leader this can't work.
Not to mention the increase of sizeof(signal_struct).
Oleg.
On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote:
> On 06/24, Paul E. McKenney wrote:
> >
> > On Thu, Jun 24, 2010 at 11:57:02PM +0200, Oleg Nesterov wrote:
> > > On 06/24, Paul E. McKenney wrote:
> > > >
> > > > 4. Some other thread might do pthread_exit(), removing itself
> > > > from the thread group, and again might do so while the hapless
> > > > reader is referencing that thread. In this case, we want
> > > > the hapless reader to continue scanning the remainder of the
> > > > thread group.
> > >
> > > Yes.
> > >
> > > But, if that thread was used as a starting point g, then
> > >
> > > before the patch: loop forever
> > > after the patch: break
> >
> > So it is OK to skip some of the other threads in this case, even
> > though they were present throughout the whole procedure?
>
> I think, yes. We can miss them in any case, they can go away before
> while_each_thread(g, t) starts the scan.
>
> If g == group_leader (old or new), then we should notice this thread
> at least.
>
> Otherwise we can miss them all, with or without next_thread_careful().
Just to be sure that we are actually talking about the same scenario...
Suppose that a task group is lead by 2908 and has member 2909, 2910,
2911, and 2912. Suppose that 2910 does pthread_exit() just as some
other task is "ls"ing the relevant /proc entry. Is it really OK for
"ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912
were alive and kicking the entire time?
Thanx, Paul
On 06/28, Paul E. McKenney wrote:
>
> On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote:
> > On 06/24, Paul E. McKenney wrote:
> > >
> > > So it is OK to skip some of the other threads in this case, even
> > > though they were present throughout the whole procedure?
> >
> > I think, yes. We can miss them in any case, they can go away before
> > while_each_thread(g, t) starts the scan.
> >
> > If g == group_leader (old or new), then we should notice this thread
> > at least.
> >
> > Otherwise we can miss them all, with or without next_thread_careful().
>
> Just to be sure that we are actually talking about the same scenario...
>
> Suppose that a task group is lead by 2908 and has member 2909, 2910,
> 2911, and 2912. Suppose that 2910 does pthread_exit() just as some
> other task is "ls"ing the relevant /proc entry. Is it really OK for
> "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912
> were alive and kicking the entire time?
Confused.
Let's return to
do
printk("%d\n", t->pid);
while_each_thread(g, t);
for the moment.
In that case, if g != 2910 (the exiting thread) we will print all pids,
except we can miss 2910. With or without next_thread_careful().
Only if we start at g == 2910, then
current code: print 2910, then spin forever printing
other pids
next_thread_careful: stop printing when we notice that 2910
was unhashed.
So, yes, in this case we can miss all
other threads.
As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated,
it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread()
logic. Let's assume that proc_task_fill_cache() never fails.
proc_task_readdir() always starts at the group_leader, 2908. So, with or
without next_thread_careful() we can only miss the exiting 2910.
But (again, unless I missed something) the current code can race with exec,
and s/next_thread/next_thread_careful/ in first_tid() can fix the race.
(just in case, we can fix it differently).
But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task"
you can miss _all_ threads if 2910 exits before proc_task_readdir() finds
its leader, 2908. Again, this is with or without next_thread_careful().
Paul, please let me know if I misunderstood your concerns, or if I missed
something.
Oleg.
On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote:
> On 06/28, Paul E. McKenney wrote:
> >
> > On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote:
> > > On 06/24, Paul E. McKenney wrote:
> > > >
> > > > So it is OK to skip some of the other threads in this case, even
> > > > though they were present throughout the whole procedure?
> > >
> > > I think, yes. We can miss them in any case, they can go away before
> > > while_each_thread(g, t) starts the scan.
> > >
> > > If g == group_leader (old or new), then we should notice this thread
> > > at least.
> > >
> > > Otherwise we can miss them all, with or without next_thread_careful().
> >
> > Just to be sure that we are actually talking about the same scenario...
> >
> > Suppose that a task group is lead by 2908 and has member 2909, 2910,
> > 2911, and 2912. Suppose that 2910 does pthread_exit() just as some
> > other task is "ls"ing the relevant /proc entry. Is it really OK for
> > "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912
> > were alive and kicking the entire time?
>
> Confused.
>
> Let's return to
>
> do
> printk("%d\n", t->pid);
> while_each_thread(g, t);
>
> for the moment.
>
> In that case, if g != 2910 (the exiting thread) we will print all pids,
> except we can miss 2910. With or without next_thread_careful().
>
> Only if we start at g == 2910, then
>
> current code: print 2910, then spin forever printing
> other pids
>
> next_thread_careful: stop printing when we notice that 2910
> was unhashed.
>
> So, yes, in this case we can miss all
> other threads.
>
> As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated,
> it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread()
> logic. Let's assume that proc_task_fill_cache() never fails.
>
> proc_task_readdir() always starts at the group_leader, 2908. So, with or
> without next_thread_careful() we can only miss the exiting 2910.
>
> But (again, unless I missed something) the current code can race with exec,
> and s/next_thread/next_thread_careful/ in first_tid() can fix the race.
> (just in case, we can fix it differently).
>
> But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task"
> you can miss _all_ threads if 2910 exits before proc_task_readdir() finds
> its leader, 2908. Again, this is with or without next_thread_careful().
>
>
> Paul, please let me know if I misunderstood your concerns, or if I missed
> something.
Thank you very much for laying this out completely! I was having a hard
time believing that it was OK to miss threads in the "ls /proc/2910/task"
case. But of course similar issues can arise when running "ls" on a
directory with lots of files that are coming and going quickly in the
meantime, I guess. And if proc_task_fill_cache() fails, we can miss
tasks as well, correct?
Given all this, I believe that your fix really does work.
Thanx, Paul
On 06/29, Paul E. McKenney wrote:
>
> On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote:
> >
> > Paul, please let me know if I misunderstood your concerns, or if I missed
> > something.
>
> Thank you very much for laying this out completely! I was having a hard
> time believing that it was OK to miss threads in the "ls /proc/2910/task"
> case. But of course similar issues can arise when running "ls" on a
> directory with lots of files that are coming and going quickly in the
> meantime, I guess.
Yes. And again, even if 2910 is not the group leader and it is exiting,
"ls /proc/2910/task" will work because proc_task_readdir() akways starts
at 2910->group_leader == 2008.
It doesn't work only if proc_task_readdir() can't find its leader, in
this particular case this just means 2910 no longer exists, and thus
/proc/2910/ is dead even if we can still find this dentry.
> And if proc_task_fill_cache() fails, we can miss
> tasks as well, correct?
Well, yes and no.
Sure, if proc_task_fill_cache() fails we didn't reported all threads.
But if /bin/ls does readdir() again after that, proc_task_readdir()
tries to contunue starting from the last-pid-we-failed-to-report.
If there is no task with that pid, we start from the group_leader
and skip the number-of-already-reported-threads.
So, we have a lot of issues here, we can miss some thread because
"skip the number-of-already-reported-threads" can't be really accurate.
But, to clarify, this has almost nothing to do with the original problem.
Afaics, if we change first_tid() to use next_thread_careful() instead
of next_thread(), we close the pure-theoretical race with exec but that
is all. (and I am still not sure this race does exist, and even if it
does we can fix it without next_thread_careful).
> Given all this, I believe that your fix really does work.
Great. I'll send the patch once I inspect zap_threads() and
current_is_single_threaded() to figure out which changes they need.
Oleg.
> We can't do call_rcu(release_task),
> we can't take tasklist for writing in the softirq context. But even
> if we could, this can't help in fact or I missed something.
Ah, I had missed that constraint of call_rcu. (It's not mentioned in the
kerneldoc comment directly, and is only buried in a couple of brief
mentions in Documentation/RCU/.) I was thinking that would suffice because
it does it between rcu_read_lock critical sections, but I guess it actually
only does it after the last one and doesn't prevent a new one from having
started on another CPU. (And we can't use it anyway.)
Thanks,
Roland
On Thu, Jul 08, 2010 at 04:59:25PM -0700, Roland McGrath wrote:
> > We can't do call_rcu(release_task),
> > we can't take tasklist for writing in the softirq context. But even
> > if we could, this can't help in fact or I missed something.
>
> Ah, I had missed that constraint of call_rcu. (It's not mentioned in the
> kerneldoc comment directly, and is only buried in a couple of brief
> mentions in Documentation/RCU/.) I was thinking that would suffice because
> it does it between rcu_read_lock critical sections, but I guess it actually
> only does it after the last one and doesn't prevent a new one from having
> started on another CPU. (And we can't use it anyway.)
Hmmm... The kerneldoc for rcu_read_lock() looks like it says this
pretty unambiguously. Then again, that is easy for me to say, given
that I was the one who made most of this stuff up. ;-)
But it has been awhile since I looked after the kerneldoc. Please see
below for the changes I would make if left to myself.
Any other suggestions for improvement to the kerneldoc?
Thanx, Paul
------------------------------------------------------------------------
commit b12483332f1a7ee2db11f4bdc3c4e61edb608b01
Author: Paul E. McKenney <[email protected]>
Date: Thu Jul 8 17:38:59 2010 -0700
rcu: improve kerneldoc for rcu_read_lock(), call_rcu(), and synchronize_rcu()
Make it explicit that new RCU read-side critical sections that start
after call_rcu() and synchronize_rcu() start might still be running
after the end of the relevant grace period.
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 508ebba..27b44b3 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -444,7 +444,7 @@ extern int rcu_my_thread_group_empty(void);
* until after the all the other CPUs exit their critical sections.
*
* Note, however, that RCU callbacks are permitted to run concurrently
- * with RCU read-side critical sections. One way that this can happen
+ * with new RCU read-side critical sections. One way that this can happen
* is via the following sequence of events: (1) CPU 0 enters an RCU
* read-side critical section, (2) CPU 1 invokes call_rcu() to register
* an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
@@ -602,11 +602,13 @@ extern void wakeme_after_rcu(struct rcu_head *head);
/**
* call_rcu() - Queue an RCU callback for invocation after a grace period.
* @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
+ * @func: actual callback function to be invoked after the grace period
*
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed. However, the callback function
+ * might well execute concurrently with RCU read-side critical sections
+ * that started after call_rcu() was invoked. RCU read-side critical
* sections are delimited by rcu_read_lock() and rcu_read_unlock(),
* and may be nested.
*/
@@ -616,9 +618,9 @@ extern void call_rcu(struct rcu_head *head,
/**
* call_rcu_bh() - Queue an RCU for invocation after a quicker grace period.
* @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
+ * @func: actual callback function to be invoked after the grace period
*
- * The update function will be invoked some time after a full grace
+ * The callback function will be invoked some time after a full grace
* period elapses, in other words after all currently executing RCU
* read-side critical sections have completed. call_rcu_bh() assumes
* that the read-side critical sections end on completion of a softirq
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 9906f85..63bb771 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -546,9 +546,11 @@ EXPORT_SYMBOL_GPL(call_rcu);
*
* Control will return to the caller some time after a full grace
* period has elapsed, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
+ * read-side critical sections have completed. Note, however, that
+ * upon return from synchronize_rcu(), the caller might well be executing
+ * concurrently with new RCU read-side critical sections that began while
+ * synchronize_rcu() was waiting. RCU read-side critical sections are
+ * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
*/
void synchronize_rcu(void)
{
> The reason for my "destroying the old" and "forming the new" is the
> possibility of someone doing proc_task_readdir() when the group leader
> does exec(), which causes all to die, and then the new process does
> pthread_create(), forming a new thread group. Because proc_task_readdir()
> neither holds a lock nor stays in an RCU read-side critical section
> for the full /proc scan, the old group might really be destroyed from
> the reader's point of view.
I haven't tried to understand the /proc code. From the userland
perspective, there is one thing called a "process" with a single ID that
the kernel calls the TGID and everybody else calls the PID, and that
container survives execs regardless of which of its threads do them.
Listing /proc/TGID/task is the way userland (i.e. debuggers) enumerate all
the threads (e.g. for attaching them all with ptrace). It's of course
expected that threads will be coming and going, so userland expects to read
it several times, until there were no new threads in the list after it
attached all the ones from the last read (so it would know if those ones
created any more).
I can't quite tell but it sounds like you may be saying that this procedure
won't work with rewinding the same fd. After an exec, that fd may point to
a reaped former leader and yield no results. (Looking at the code now, it
looks like readdir will fail with the unusual error ENOENT in that case, so
userland could detect that case easily now.) To pick up the next iteration
of that procedure correctly, you'd have to reopen /proc/TGID/task by name
to get an fd associated with the new leader. That is the only thing I can
think of that is meaningful in userland terms and might be what you mean by
"destroying the old and forming the new". Is that it?
But it also sounds like you may be saying that the lack of locking in
proc_task_readdir means it could just punt out of its listing loop early at
any time that the task it just looked at is reaped. Is that right? That
is OK for userland if any given readdir call returns a short list--but not
if it returns a premature EOF. It looks to me like that's possible too.
If so, that is startling off hand, and breaks the userland expectation by
the "least astonishment" principle. (That is, you can sometimes get a
short listing showing a subset of threads that does not include some
threads you previously saw as alive and are still alive.) It can also
actually break the procedure I described above if one false EOF causes the
reader to miss a new thread it hasn't seen before, so it thinks it has
already stopped all the threads that are alive.
I don't know of anything in userland using that procedure. But it's
what I would have recommended if asked, before you brought this issue
up. (strace -p does a single pass and is only intending to offer any
guarantee if you've already finished stopping the thing with SIGSTOP
first. gdb uses other means that amount to setting a breakpoint inside
pthread_create before reading libpthread's data structures from user
memory for the userland thread list without regard to the actual kernel
thread situation.)
I suppose we can just say that proc_task_readdir is entirely unreliable
unless you're sure otherwise that threads are not being reaped while you
read it, since that seems to be the truth of it. I would sure be
happier as a userland programmer if the kernel gave something that I
could rely on by some feasible race-noticing procedure akin to that
above, but it's not the end of the world.
Thanks,
Roland
> That we don't cleanup that zombie leaders is unfortunate really, it
> means we have the entire de_thread special case. But short fixing
> libpthread to not make bad assumptions there is little we can do about
> it really.
To be fair (perish the thought!), the semantics for these things were
clearly specified in POSIX-1996 years before we had made any attempt in
Linux at implementing them in the current fashion or any other, so we have
no one but ourselves to blame for our current implementation choices.
There are no required semantics about "zombie leaders". The semantics are
that wait* calls refer to the whole process (thread group) and those wait
events don't occur until all threads (the initial thread or others) are
dead (or all stopped).
Nothing really says we have to keep the leader's task_struct around, we
just need to keep the tgid assigned. We could have the signal_struct be
the thing that holds the tgid and is on the children list that wait looks
at, and let dead leaders "self-reap" like other threads do.
Since we've already separated the ptrace list and pseudo-wait from the
proper children/wait list, it's only some of these existing implementation
assumptions and perhaps some /proc vagaries that prevent us doing that.
> I'm only half following this conversation.
Going for a third here.
Thanks,
Roland
> Hmmm... The kerneldoc for rcu_read_lock() looks like it says this
> pretty unambiguously. Then again, that is easy for me to say, given
> that I was the one who made most of this stuff up. ;-)
Yeah, well, I thought I knew what rcu_read_lock did and the only new thing
I had in mind was using call_rcu, so that's the only place I looked. It
wasn't entirely unreasonable to have to read through Documentation/RCU/ a
bit for a subtlety of this kind, and I did find it quickly enough once
Oleg's attentiveness alerted me to be freshly afraid of the subtleties.
> But it has been awhile since I looked after the kerneldoc. Please see
> below for the changes I would make if left to myself.
Those look good to me!
> Any other suggestions for improvement to the kerneldoc?
Oh, I don't know, send over someone to read it to me in a sultry voice
so I get appropriately motivated to cast about for all of it I can find
and listen closely to every word? It already looks like if I'd actually
read all of it twice that day, I would have had the information I needed.
Thanks,
Roland
On Thu, Jul 08, 2010 at 06:01:37PM -0700, Roland McGrath wrote:
> > Hmmm... The kerneldoc for rcu_read_lock() looks like it says this
> > pretty unambiguously. Then again, that is easy for me to say, given
> > that I was the one who made most of this stuff up. ;-)
>
> Yeah, well, I thought I knew what rcu_read_lock did and the only new thing
> I had in mind was using call_rcu, so that's the only place I looked. It
> wasn't entirely unreasonable to have to read through Documentation/RCU/ a
> bit for a subtlety of this kind, and I did find it quickly enough once
> Oleg's attentiveness alerted me to be freshly afraid of the subtleties.
>
> > But it has been awhile since I looked after the kerneldoc. Please see
> > below for the changes I would make if left to myself.
>
> Those look good to me!
Very good, I have them queued. I have some messy dependencies in my
tree, so it will take some time for them to appear.
> > Any other suggestions for improvement to the kerneldoc?
>
> Oh, I don't know, send over someone to read it to me in a sultry voice
> so I get appropriately motivated to cast about for all of it I can find
> and listen closely to every word? It already looks like if I'd actually
> read all of it twice that day, I would have had the information I needed.
Hmmm... Despite being happily married for well over two decades, I do
feel the need to advise you to be careful what you wish for. ;-)
Thanx, Paul