LinuxLists.cc - Re: [rfc] x86, bts: fix crash

2009-03-26 02:01:42

Subject: Re: [rfc] x86, bts: fix crash

Metzger, let's move this discussion to lkml, I also cc'ed Roland.
Imho, this problem (which I don't fully understand ;) needs more
eyes.

On 03/25, Metzger, Markus T wrote:
>
> The attached patch

Please don't use attachments ;)

> I would appreciate any comment or directions you have if you think that this is the
>
> wrong approach to tackle this problem.

Metzger, this all is a black magic to me, I don't even know what bts does.
But at least some bits doesn't look right to me.

> +static void ptrace_bts_exit_tracer(void)
> {
> + struct task_struct *child, *n;
> +
> /*
> - * Ptrace_detach() races with ptrace_untrace() in case
> - * the child dies and is reaped by another thread.
> + * We must not hold the tasklist_lock when we release the bts
> + * tracer, since we need to make sure the cpu executing the
> + * tracee stops tracing before we may free the trace buffer.
> *
> - * We only do the memory accounting at this point and
> - * leave the buffer deallocation and the bts tracer
> - * release to ptrace_bts_untrace() which will be called
> - * later on with tasklist_lock held.
> + * read_lock(&tasklist_lock) and smp_call_function() may
> + * deadlock with write_lock_irq(&tasklist_lock).
> + *
> + * As long as we're the only one modifying our list of traced
> + * children, we should be safe, since we're currently busy.
> */
> - release_locked_buffer(child->bts_buffer, child->bts_size);
> + list_for_each_entry_safe(child, n, &current->ptraced, ptrace_entry) {

It is not safe to iterate over current->ptraced lockless, the comment
is not right. Anoter subthread can reap the dead tracee, or at least
do ptrace_unlink() if the tracee is not the child of our thread group.

> +static void ptrace_bts_exit_tracee(void)
> +{
> + struct mm_struct *mm = NULL;
> +
> + /*
> + * This code may race with ptrace reparenting.
> + *
> + * We hold the tasklist lock while we check whether we're
> + * ptraced and grab our ptracer's mm.
> + *
> + * It does not really matter if we're reparented,
> + * afterwards. We hold onto the correct mm.
> + *
> + * If we're reparented before we get the tasklist_lock, our
> + * former ptrace parent will release the bts tracer.
> + */
> + write_lock_irq(&tasklist_lock);
> + if (current->ptrace)
> + mm = get_task_mm(current->parent);

We can't take task_lock() (called by get_task_mm) under write_lock(tasklist),
this is deadlockable. Afaics, read_lock() is enough here, in that case it is
safe to call get_task_mm().

> @@ -752,6 +752,14 @@ void ds_release_bts(struct bts_tracer *t
>
> ds_suspend_bts(tracer);
>
> + /*
> + * We must wait for the suspend to take effect before we may
> + * free the tracer and the ds configuration.
> + */
> + if (tracer->ds.context->task &&
> + (tracer->ds.context->task != current))
> + wait_task_inactive(tracer->ds.context->task, 0);

I am not sure I understand the problem. From the changelog:

If the children are currently executing, the buffer
may be freed while the hardware is still tracing.
This might cause the hardware to overwrite memory.

So, the problem is that ds.context->task must not be running before we
can start to disable/free ds, yes? Something like ds_switch_to() should
be completed, right?

In that case I don't really understand how wait_task_inactive() can help.
If the task is killed it can be scheduled again, right after
wait_task_inactive() returns.

Also. This function is called from ptrace_bts_exit_tracer(), when the
tracee is not stopped. In this case wait_task_inactive() can spin forever.
For example, if the tracee simply does "for (;;) ;" it never succeeds.

If my understanding of the problem is wrong, could you please explain
it for dummies?

Oleg.

2009-03-27 15:01:39

by Metzger, Markus T

[permalink] [raw]

Subject: RE: [rfc] x86, bts: fix crash

>-----Original Message-----
>From: Oleg Nesterov [mailto:[email protected]]
>Sent: Thursday, March 26, 2009 2:58 AM
>To: Metzger, Markus T

>Metzger, let's move this discussion to lkml, I also cc'ed Roland.
>Imho, this problem (which I don't fully understand ;) needs more
>eyes.

Oleg,

I just found your reply in my spam folder;-)

>> +static void ptrace_bts_exit_tracer(void)
>> {
>> + struct task_struct *child, *n;
>> +
>> /*
>> - * Ptrace_detach() races with ptrace_untrace() in case
>> - * the child dies and is reaped by another thread.
>> + * We must not hold the tasklist_lock when we release the bts
>> + * tracer, since we need to make sure the cpu executing the
>> + * tracee stops tracing before we may free the trace buffer.
>> *
>> - * We only do the memory accounting at this point and
>> - * leave the buffer deallocation and the bts tracer
>> - * release to ptrace_bts_untrace() which will be called
>> - * later on with tasklist_lock held.
>> + * read_lock(&tasklist_lock) and smp_call_function() may
>> + * deadlock with write_lock_irq(&tasklist_lock).
>> + *
>> + * As long as we're the only one modifying our list of traced
>> + * children, we should be safe, since we're currently busy.
>> */
>> - release_locked_buffer(child->bts_buffer, child->bts_size);
>> + list_for_each_entry_safe(child, n, &current->ptraced, ptrace_entry) {
>
>It is not safe to iterate over current->ptraced lockless, the comment
>is not right. Anoter subthread can reap the dead tracee, or at least
>do ptrace_unlink() if the tracee is not the child of our thread group.

I meanwhile changed that to read_lock(&tasklist_lock) around the iteration.
I temporarily drop the lock in the loop body.
After retaking the lock, I check whether next is still child->next.
If not, I start over again.

Individual adds or removes should not be able to break this. We could break it
if we split the list, but I'm not aware of such an operation on the list of
ptraced children.

>> +static void ptrace_bts_exit_tracee(void)
>> +{
>> + struct mm_struct *mm = NULL;
>> +
>> + /*
>> + * This code may race with ptrace reparenting.
>> + *
>> + * We hold the tasklist lock while we check whether we're
>> + * ptraced and grab our ptracer's mm.
>> + *
>> + * It does not really matter if we're reparented,
>> + * afterwards. We hold onto the correct mm.
>> + *
>> + * If we're reparented before we get the tasklist_lock, our
>> + * former ptrace parent will release the bts tracer.
>> + */
>> + write_lock_irq(&tasklist_lock);
>> + if (current->ptrace)
>> + mm = get_task_mm(current->parent);
>
>We can't take task_lock() (called by get_task_mm) under write_lock(tasklist),
>this is deadlockable. Afaics, read_lock() is enough here, in that case it is
>safe to call get_task_mm().

Thanks. I changed that to read_lock.

>> @@ -752,6 +752,14 @@ void ds_release_bts(struct bts_tracer *t
>>
>> ds_suspend_bts(tracer);
>>
>> + /*
>> + * We must wait for the suspend to take effect before we may
>> + * free the tracer and the ds configuration.
>> + */
>> + if (tracer->ds.context->task &&
>> + (tracer->ds.context->task != current))
>> + wait_task_inactive(tracer->ds.context->task, 0);
>
>I am not sure I understand the problem. From the changelog:
>
> If the children are currently executing, the buffer
> may be freed while the hardware is still tracing.
> This might cause the hardware to overwrite memory.
>
>So, the problem is that ds.context->task must not be running before we
>can start to disable/free ds, yes? Something like ds_switch_to() should
>be completed, right?
>
>In that case I don't really understand how wait_task_inactive() can help.
>If the task is killed it can be scheduled again, right after
>wait_task_inactive() returns.

We first call ds_suspend_bts().
This clears the branch tracing control bits for the traced task and already
writes the updated value to the msr, if running on the current cpu.
If the task is running on a different cpu, the updated value will be written
when the task is scheduled out.
By waiting for the task to become inactive, we know that it has been scheduled out
at least once after we changed the bits. So we know that the hardware will not use
the tracing configuration for that task and we can safely free the memory.

>Also. This function is called from ptrace_bts_exit_tracer(), when the
>tracee is not stopped. In this case wait_task_inactive() can spin forever.
>For example, if the tracee simply does "for (;;) ;" it never succeeds.

As far as I understand, wait_task_inactive() returns when the task is scheduled out.
It just loops as long as the task is running.

regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

2009-03-27 16:54:20

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [rfc] x86, bts: fix crash

On 03/27, Metzger, Markus T wrote:
>
> >> @@ -752,6 +752,14 @@ void ds_release_bts(struct bts_tracer *t
> >>
> >> ds_suspend_bts(tracer);
> >>
> >> + /*
> >> + * We must wait for the suspend to take effect before we may
> >> + * free the tracer and the ds configuration.
> >> + */
> >> + if (tracer->ds.context->task &&
> >> + (tracer->ds.context->task != current))
> >> + wait_task_inactive(tracer->ds.context->task, 0);
> >
> >I am not sure I understand the problem. From the changelog:
> >
> > If the children are currently executing, the buffer
> > may be freed while the hardware is still tracing.
> > This might cause the hardware to overwrite memory.
> >
> >So, the problem is that ds.context->task must not be running before we
> >can start to disable/free ds, yes? Something like ds_switch_to() should
> >be completed, right?
> >
> >In that case I don't really understand how wait_task_inactive() can help.
> >If the task is killed it can be scheduled again, right after
> >wait_task_inactive() returns.
>
> We first call ds_suspend_bts().
> This clears the branch tracing control bits for the traced task and already
> writes the updated value to the msr, if running on the current cpu.
> If the task is running on a different cpu, the updated value will be written
> when the task is scheduled out.
> By waiting for the task to become inactive, we know that it has been scheduled out
> at least once after we changed the bits. So we know that the hardware will not use
> the tracing configuration for that task and we can safely free the memory.

Still can't understand...

Let's suppose the traced task is scheduled again, right after
wait_task_inactive() returns a before we set ds.context->bts_master = NULL.

In this case, can't ds_switch_to() (which plays with ds_context) race
with ds_put_context()->kfree(context) ?

> >Also. This function is called from ptrace_bts_exit_tracer(), when the
> >tracee is not stopped. In this case wait_task_inactive() can spin forever.
> >For example, if the tracee simply does "for (;;) ;" it never succeeds.
>
> As far as I understand, wait_task_inactive() returns when the task is scheduled out.

Yes. But the task does above is never scheduled out, it is always running
even if preempted by another task. wait_task_inactive() returns when
->on_rq == 0, iow when the task sleeps.

This means that the tracer can hang "forever" during exit, until the tracee
does the blocking syscall or exits.

This is not acceptable, imho.

Oleg.

2009-03-27 17:33:46

by Markus Metzger

[permalink] [raw]

Subject: Re: [rfc] x86, bts: fix crash

On Fri, 2009-03-27 at 17:50 +0100, Oleg Nesterov wrote:
> On 03/27, Metzger, Markus T wrote:
> >
> > >> @@ -752,6 +752,14 @@ void ds_release_bts(struct bts_tracer *t
> > >>
> > >> ds_suspend_bts(tracer);
> > >>
> > >> + /*
> > >> + * We must wait for the suspend to take effect before we may
> > >> + * free the tracer and the ds configuration.
> > >> + */
> > >> + if (tracer->ds.context->task &&
> > >> + (tracer->ds.context->task != current))
> > >> + wait_task_inactive(tracer->ds.context->task, 0);
> > >
> > >I am not sure I understand the problem. From the changelog:
> > >
> > > If the children are currently executing, the buffer
> > > may be freed while the hardware is still tracing.
> > > This might cause the hardware to overwrite memory.
> > >
> > >So, the problem is that ds.context->task must not be running before we
> > >can start to disable/free ds, yes? Something like ds_switch_to() should
> > >be completed, right?
> > >
> > >In that case I don't really understand how wait_task_inactive() can help.
> > >If the task is killed it can be scheduled again, right after
> > >wait_task_inactive() returns.
> >
> > We first call ds_suspend_bts().
> > This clears the branch tracing control bits for the traced task and already
> > writes the updated value to the msr, if running on the current cpu.
> > If the task is running on a different cpu, the updated value will be written
> > when the task is scheduled out.
> > By waiting for the task to become inactive, we know that it has been scheduled out
> > at least once after we changed the bits. So we know that the hardware will not use
> > the tracing configuration for that task and we can safely free the memory.
>
> Still can't understand...
>
> Let's suppose the traced task is scheduled again, right after
> wait_task_inactive() returns a before we set ds.context->bts_master = NULL.
>
> In this case, can't ds_switch_to() (which plays with ds_context) race
> with ds_put_context()->kfree(context) ?

You're right. Now I see the problem.

The bts_master pointer needs to be set to NULL while the traced task is
off the cpu.

The kfree will only happen if the context is not used. If another tracer
immediately reuses the context and traces the same task, it will
become the new master and overwrite the configuration.

I assumed that the traced task would always be stopped, since that's
the ptrace scenario. But it's not true for disable, which may be called
by an exiting tracer while the tracee is running.

I would still require it for ds_request_bts_task(), though. I will add
a comment regarding this.

Regarding the race on task->thread.ds_ctx between ds_release_bts() and
ds_switch_to(), how would I prevent a task from being rescheduled for
a small amount of time?

> > >Also. This function is called from ptrace_bts_exit_tracer(), when the
> > >tracee is not stopped. In this case wait_task_inactive() can spin forever.
> > >For example, if the tracee simply does "for (;;) ;" it never succeeds.
> >
> > As far as I understand, wait_task_inactive() returns when the task is scheduled out.
>
> Yes. But the task does above is never scheduled out, it is always running
> even if preempted by another task. wait_task_inactive() returns when
> ->on_rq == 0, iow when the task sleeps.

Oops.

> This means that the tracer can hang "forever" during exit, until the tracee
> does the blocking syscall or exits.
>
> This is not acceptable, imho.

Definitely not.

What I would need here (see above) is to get the traced task off any cpu
in order to clear the bts_master pointer.
Is there some existing function that I could use?
Could you point me to some code that does something similar?

thanks,
markus.

2009-03-27 21:33:30

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [rfc] x86, bts: fix crash

On 03/27, Markus Metzger wrote:
>
> Regarding the race on task->thread.ds_ctx between ds_release_bts() and
> ds_switch_to(), how would I prevent a task from being rescheduled for
> a small amount of time?

I don't see how we can do this. We can split wait_task_inactive() into
2 functions, the first one returns with task_rq_lock() held and interrupts
disabled. But this is nasty, and in any case wait_task_inactive(p) can't
force "p" to be deactivated.

Can't we do something different?

For simplicity, let's suppose that we have only task_struct->bts and it
is just a blob of memory which can be used by CPU somehow.

First, we add "struct rcu_head" into task_struct->bts, and then

void free_bts((struct rcu_head *rcu)
{
struct bts_tracer *bts = container_of();
...
kfree(bts);
}

void ds_release_bts(struct bts_tracer *tracer)
{
struct task_struct *child = tracer->ds.context->task;
struct bts_tracer *bts = child->bts;

child->bts = NULL;

// make sure child will NOT use ->bts
// after the next context switch,
// clear TIF_DS_AREA_MSR or something
...

call_rcu_sched(bts->rcu, free_bts);
}

Now we can call ds_release_bts() from the atomic context (as we do
now).

Once again, the pseudo code above has nothing to do with reality,
just for illustration.

(as for the memory accounting, this is another issue, let's forget
for now).

Oleg.

2009-03-30 07:25:07

by Metzger, Markus T

[permalink] [raw]

Subject: RE: [rfc] x86, bts: fix crash

>-----Original Message-----
>From: Oleg Nesterov [mailto:[email protected]]
>Sent: Friday, March 27, 2009 10:30 PM
>To: Markus Metzger

>> Regarding the race on task->thread.ds_ctx between ds_release_bts() and
>> ds_switch_to(), how would I prevent a task from being rescheduled for
>> a small amount of time?
>
>I don't see how we can do this. We can split wait_task_inactive() into
>2 functions, the first one returns with task_rq_lock() held and interrupts
>disabled. But this is nasty, and in any case wait_task_inactive(p) can't
>force "p" to be deactivated.
>
>Can't we do something different?
>
>For simplicity, let's suppose that we have only task_struct->bts and it
>is just a blob of memory which can be used by CPU somehow.
>
>First, we add "struct rcu_head" into task_struct->bts, and then
>
> void free_bts((struct rcu_head *rcu)
> {
> struct bts_tracer *bts = container_of();
> ...
> kfree(bts);
> }
>
> void ds_release_bts(struct bts_tracer *tracer)
> {
> struct task_struct *child = tracer->ds.context->task;
> struct bts_tracer *bts = child->bts;
>
> child->bts = NULL;
>
> // make sure child will NOT use ->bts
> // after the next context switch,
> // clear TIF_DS_AREA_MSR or something
> ...

But that's exactly the problem if the bts_tracer is released by another
task (!= current).
We first need to disable tracing by clearing bits in debugctlmsr.
Then, we can set ->bts to NULL and clear TIF_DS_AREA_MSR.

If the traced task is currently in a context switch, clearing debugctlmsr bits
might come too late. We need to wait for the next context switch until we can
clear TIF_DS_AREA_MSR.

The benefit would be that I don't need to hook into do_exit() anymore.
This would rid us of the nasty ->ptraced loop.
I will give it a try.

I use something like this to wait for the context switch:
nvcsw = task->nvcsw + 1;
nivcsw = task->nivcsw + 1;
for (;;) {
if (nvcsw < task->nvcsw)
break;
if (nivcsw < task->nivcsw)
break;
if (task->state != TASK_RUNNING)
break;
}

I would need to check for overflows, but for my examples, I don't expect any.

regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

2009-03-30 11:30:11

by Metzger, Markus T

[permalink] [raw]

Subject: RE: [rfc] x86, bts: fix crash

>-----Original Message-----
>From: Metzger, Markus T
>Sent: Monday, March 30, 2009 9:24 AM
>To: Oleg Nesterov; Markus Metzger
>Cc: Kleen, Andi; Ingo Molnar; Roland McGrath; [email protected]

>But that's exactly the problem if the bts_tracer is released by another
>task (!= current).
>We first need to disable tracing by clearing bits in debugctlmsr.
>Then, we can set ->bts to NULL and clear TIF_DS_AREA_MSR.
>
>If the traced task is currently in a context switch, clearing debugctlmsr bits
>might come too late. We need to wait for the next context switch until we can
>clear TIF_DS_AREA_MSR.
>
>
>The benefit would be that I don't need to hook into do_exit() anymore.
>This would rid us of the nasty ->ptraced loop.
>I will give it a try.
>
>
>I use something like this to wait for the context switch:
> nvcsw = task->nvcsw + 1;
> nivcsw = task->nivcsw + 1;
> for (;;) {
> if (nvcsw < task->nvcsw)
> break;
> if (nivcsw < task->nivcsw)
> break;
> if (task->state != TASK_RUNNING)
> break;
> }
>
>I would need to check for overflows, but for my examples, I don't expect any.

That is not quite right, as well. There's a race on the task state.
In my example, I got TASK_DEAD before the dying task could complete its
final schedule(), and the cpu continued tracing.

When I use the task_running() test that wait_task_inactive() uses, as well,
it seems to work. Together with checking the number of context switches,
it should be reasonably fast, as well (I don't run the risk of being scheduled
perfectly synchronous to the task I'm waiting for).

I would need to add an extern function task_is_running(struct task_struct *)
to sched.h, though. I hope that's acceptable.

regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

2009-03-30 13:33:00

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [rfc] x86, bts: fix crash

On 03/30, Metzger, Markus T wrote:
>
> >The benefit would be that I don't need to hook into do_exit() anymore.

Metzger, I got lost ;) And I didn't sleep today, so most probably I missed
what you mean...

do you mean the helper below will be called under write_lock_irq(tasklist)?
In that case,

> >This would rid us of the nasty ->ptraced loop.
> >I will give it a try.
> >
> >
> >I use something like this to wait for the context switch:
> > nvcsw = task->nvcsw + 1;
> > nivcsw = task->nivcsw + 1;
> > for (;;) {
> > if (nvcsw < task->nvcsw)
> > break;
> > if (nivcsw < task->nivcsw)
> > break;

Not exactly right, schedule() increments nvcsw/nivcsw before context_switch().
But this is fixable.

However. What if this task spins in TASK_RUNNING waiting for tasklist_lock ?
This is deadlockable even with CONFIG_PREEMPT, we take tasklit for reading
in interrupt context.

Afaics, we can also deadlock if task_cpu(task) sends IPI to us (with wait = 1),
the sender spins with preemption disabled.

> > if (task->state != TASK_RUNNING)
> > break;
> > }
> >
>
> That is not quite right, as well. There's a race on the task state.
> In my example, I got TASK_DEAD before the dying task could complete its
> final schedule(), and the cpu continued tracing.

But we still have the same problems.

If the tracee doesn't call a blocking syscall, its ->state is always RUNNING.

It could be woken right after we see !TASK_RUNNING or !task_running().

Oleg.

2009-03-30 13:57:24

by Metzger, Markus T

[permalink] [raw]

Subject: RE: [rfc] x86, bts: fix crash

>-----Original Message-----
>From: Oleg Nesterov [mailto:[email protected]]
>Sent: Monday, March 30, 2009 3:29 PM
>To: Metzger, Markus T

>> >The benefit would be that I don't need to hook into do_exit() anymore.
>
>Metzger, I got lost ;) And I didn't sleep today, so most probably I missed
>what you mean...

The way I understood you, I should defer the release of the bts tracer.
I can schedule the work in __ptrace_unlink(), so I don't need the changes
that hook into do_exit().
The work will be done at a later time with interrupts enabled.

I'm looking into schedule_work() right now, since I don't need all the other
features of RCU.

>do you mean the helper below will be called under write_lock_irq(tasklist)?
>In that case,
>
>> >This would rid us of the nasty ->ptraced loop.
>> >I will give it a try.
>> >
>> >
>> >I use something like this to wait for the context switch:
>> > nvcsw = task->nvcsw + 1;
>> > nivcsw = task->nivcsw + 1;
>> > for (;;) {
>> > if (nvcsw < task->nvcsw)
>> > break;
>> > if (nivcsw < task->nivcsw)
>> > break;
>
>Not exactly right, schedule() increments nvcsw/nivcsw before context_switch().
>But this is fixable.

That's why I added +1. There's still the overflow problem. I now use

nvcsw = task->nvcsw;
for (;;) {
if ((task->nvcsw - nvcsw) > 1)
break;
...
if (!task_is_running(task))
break;
schedule();
}

this should work even for overflowing counters. It waits for two
context switches, or - preferably - for task to be currently not running
on any cpu (see below for task_is_running()).

>However. What if this task spins in TASK_RUNNING waiting for tasklist_lock ?
>This is deadlockable even with CONFIG_PREEMPT, we take tasklit for reading
>in interrupt context.

That code is executed with interrupts enabled and tasklist lock not held.
That's why I added the ptrace_bts_exit_tracer() and ptrace_bts_exit_tracee()
calls - to be able to call ds_release_bts() with interrupts enabled.

>Afaics, we can also deadlock if task_cpu(task) sends IPI to us (with wait = 1),
>the sender spins with preemption disabled.
>
>> > if (task->state != TASK_RUNNING)
>> > break;
>> > }
>> >
>>
>> That is not quite right, as well. There's a race on the task state.
>> In my example, I got TASK_DEAD before the dying task could complete its
>> final schedule(), and the cpu continued tracing.
>
>But we still have the same problems.
>
>If the tracee doesn't call a blocking syscall, its ->state is always RUNNING.

Agreed.

I meanwhile added a function task_is_running() to sched.h that checks whether
the parameter task is currently running on any cpu.
I use that instead of checking ->state.

The function is essentially:
int task_is_running(struct task_struct *p)
{
struct rq *rq;
unsigned long flags;
int running;

rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
task_rq_unlock(rq, &flags);

return running;
}

thanks and regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.