2005-05-12 14:44:03

by Daniel Walker

[permalink] [raw]
Subject: RT and Cascade interrupts


It seems like the cascade interrupts could run in threads, but
i386 doesn't, and I know ARM crashed with cascades in threads. You may have a
bit of a slow down, but it seems possible. Does anyone have some reasoning
for why we aren't running the cascades in threads?

Daniel



2005-05-13 07:44:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: RT and Cascade interrupts


* Daniel Walker <[email protected]> wrote:

> It seems like the cascade interrupts could run in threads, but
> i386 doesn't, and I know ARM crashed with cascades in threads. You may
> have a bit of a slow down, but it seems possible. Does anyone have
> some reasoning for why we aren't running the cascades in threads?

are the x86 cascade interrupts real ones in fact? Normally they should
never trigger directly. (except on ARM which has a completely different
notion of cascade irq)

Ingo

2005-05-13 13:13:38

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
>
>> It seems like the cascade interrupts could run in threads, but
>>i386 doesn't, and I know ARM crashed with cascades in threads. You may
>>have a bit of a slow down, but it seems possible. Does anyone have
>>some reasoning for why we aren't running the cascades in threads?
>
>
> are the x86 cascade interrupts real ones in fact? Normally they should
> never trigger directly. (except on ARM which has a completely different
> notion of cascade irq)

I just caught this thread in the corner of my eye.

I'm seeing the BUG assert in kernel/timers.c:cascade()
kick in (tmp->base is somehow 0) during a test which
creates a few tasks of priority higher than ksoftirqd.
This race doesn't happen if ksoftirqd's priority is
elevated (eg: chrt -f -p 75 2) so the -RT patch might
be opening up a window here.

I'm in the process of investigating this and see a few
potential suspects but was wondering if anyone else has
seen this behavior?

-john

--
[email protected]

2005-05-24 16:40:33

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

./include/linux/sunrpc/sched.h
./net/sunrpc/sched.c
=================================================================
--- ./include/linux/sunrpc/sched.h.ORG 2005-05-24 10:29:24.000000000 -0400
+++ ./include/linux/sunrpc/sched.h 2005-05-24 10:47:56.000000000 -0400
@@ -142,7 +142,6 @@ typedef void (*rpc_action)(struct rpc_
#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
#define RPC_TASK_WAKEUP 2
-#define RPC_TASK_HAS_TIMER 3

#define RPC_IS_RUNNING(t) (test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
#define rpc_set_running(t) (set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
=================================================================
--- ./net/sunrpc/sched.c.ORG 2005-05-24 10:29:52.000000000 -0400
+++ ./net/sunrpc/sched.c 2005-05-24 11:02:44.000000000 -0400
@@ -103,9 +103,6 @@ static void rpc_run_timer(struct rpc_tas
dprintk("RPC: %4d running timer\n", task->tk_pid);
callback(task);
}
- smp_mb__before_clear_bit();
- clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
- smp_mb__after_clear_bit();
}

/*
@@ -124,7 +121,6 @@ __rpc_add_timer(struct rpc_task *task, r
task->tk_timeout_fn = timer;
else
task->tk_timeout_fn = __rpc_default_timer;
- set_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
mod_timer(&task->tk_timer, jiffies + task->tk_timeout);
}

@@ -135,7 +131,7 @@ __rpc_add_timer(struct rpc_task *task, r
static inline void
rpc_delete_timer(struct rpc_task *task)
{
- if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
+ if (task->tk_timer.base) {
del_singleshot_timer_sync(&task->tk_timer);
dprintk("RPC: %4d deleting timer\n", task->tk_pid);
}


Attachments:
RPC.patch (1.59 kB)

2005-05-27 07:37:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: RT and Cascade interrupts


* john cooper <[email protected]> wrote:

> john cooper wrote:
> >I'm seeing the BUG assert in kernel/timers.c:cascade()
> >kick in (tmp->base is somehow 0) during a test which
> >creates a few tasks of priority higher than ksoftirqd.
> >This race doesn't happen if ksoftirqd's priority is
> >elevated (eg: chrt -f -p 75 2) so the -RT patch might
> >be opening up a window here.
>
> There is a window in rpc_run_timer() which allows
> it to lose track of timer ownership when ksoftirqd
> (and thus itself) are preempted. This doesn't
> immediately cause a problem but does corrupt
> the timer cascade list when the timer struct is
> recycled/requeued. This shows up some time later
> as the list is processed. The failure mode is cascade()
> attempting to percolate a timer with poisoned
> next/prev *s and a NULL base causing the assertion
> BUG(tmp->base != base) to kick in.
>
> The RPC code is attempting to replicate state of
> timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER
> in rpc_task.tk_runstate. Besides not working
> correctly in the case of preemptable context it is
> a replication of state of a timer pending in the
> cascade structure (ie: timer->base). The fix
> changes the RPC code to use timer->base when
> deciding whether an outstanding timer registration
> exists during rpc_task tear down.
>
> Note: this failure occurred in the 40-04 version of
> the patch though it applies to more current versions.
> It was seen when executing stress tests on a number
> of PPC targets running on an NFS mounted root though
> was not observed on a x86 target under similar
> conditions.

should this fix go upstream too?

Ingo

2005-05-27 13:56:00

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

Ingo Molnar wrote:
> * john cooper <[email protected]> wrote:
>>The RPC code is attempting to replicate state of
>>timer ownership for a given rpc_task via RPC_TASK_HAS_TIMER
>>in rpc_task.tk_runstate. Besides not working
>>correctly in the case of preemptable context it is
>>a replication of state of a timer pending in the
>>cascade structure (ie: timer->base). The fix
>>changes the RPC code to use timer->base when
>>deciding whether an outstanding timer registration
>>exists during rpc_task tear down.
>>
>>Note: this failure occurred in the 40-04 version of
>>the patch though it applies to more current versions.
>>It was seen when executing stress tests on a number
>>of PPC targets running on an NFS mounted root though
>>was not observed on a x86 target under similar
>>conditions.
>
>
> should this fix go upstream too?

Yes. The RPC code is attempting to replicate existing
and easily accessible state information in a timer
structure. The simplistic means by which it does so
fails if ksoftirqd/rpc_run_timer() runs in preemptive
context.

-john


--
[email protected]

2005-05-27 16:39:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

john cooper wrote:
>
> rpc_delete_timer(struct rpc_task *task)
> {
> - if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> + if (task->tk_timer.base) {
> del_singleshot_timer_sync(&task->tk_timer);
> dprintk("RPC: %4d deleting timer\n", task->tk_pid);
> }

I know nothing about rpc, but this looks completely wrong to me.

Please don't use the ->base directly, use timer_pending(). This
field is never NULL in -mm tree.

Next, timer_pending() == 0 does not mean that del_timer_sync() is
not needed, it may be running on the other CPU.

Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER
was invented to indicate when it is safe not to wait for timer
completition. This bit is cleared after ->tk_timeout_fn call in
the timer's handler. After that rpc_run_timer will not touch *task.

Oleg.

2005-05-27 23:39:10

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

--- ./include/linux/sunrpc/sched.h.ORG 2005-05-24 10:29:24.000000000 -0400
+++ ./include/linux/sunrpc/sched.h 2005-05-24 10:47:56.000000000 -0400
@@ -142,7 +142,6 @@ typedef void (*rpc_action)(struct rpc_
#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
#define RPC_TASK_WAKEUP 2
-#define RPC_TASK_HAS_TIMER 3

#define RPC_IS_RUNNING(t) (test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
#define rpc_set_running(t) (set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
=================================================================
--- ./net/sunrpc/sched.c.ORG 2005-05-24 10:29:52.000000000 -0400
+++ ./net/sunrpc/sched.c 2005-05-27 18:27:41.000000000 -0400
@@ -103,9 +103,6 @@ static void rpc_run_timer(struct rpc_tas
dprintk("RPC: %4d running timer\n", task->tk_pid);
callback(task);
}
- smp_mb__before_clear_bit();
- clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
- smp_mb__after_clear_bit();
}

/*
@@ -124,7 +121,6 @@ __rpc_add_timer(struct rpc_task *task, r
task->tk_timeout_fn = timer;
else
task->tk_timeout_fn = __rpc_default_timer;
- set_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate);
mod_timer(&task->tk_timer, jiffies + task->tk_timeout);
}

@@ -135,7 +131,7 @@ __rpc_add_timer(struct rpc_task *task, r
static inline void
rpc_delete_timer(struct rpc_task *task)
{
- if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
+ if (timer_pending(&task->tk_timer)) {
del_singleshot_timer_sync(&task->tk_timer);
dprintk("RPC: %4d deleting timer\n", task->tk_pid);
}
@@ -849,6 +845,7 @@ void rpc_release_task(struct rpc_task *t
task->tk_active = 0;

/* Synchronously delete any running timer */
+ task->tk_timeout = 0;
rpc_delete_timer(task);

/* Release resources */


Attachments:
RPC.patch2 (1.69 kB)

2005-05-28 08:44:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

john cooper wrote:
>
> Oleg Nesterov wrote:
> > john cooper wrote:
> >
> >> rpc_delete_timer(struct rpc_task *task)
> >> {
> >>- if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> >>+ if (task->tk_timer.base) {
> >> del_singleshot_timer_sync(&task->tk_timer);
> >> dprintk("RPC: %4d deleting timer\n", task->tk_pid);
> >> }
> >
> >
> > I know nothing about rpc, but this looks completely wrong to me.
>
> > Next, timer_pending() == 0 does not mean that del_timer_sync() is
> > not needed, it may be running on the other CPU.
>
> Whether timer_pending() is SMP safe depends on the caller's
> context relative to the overall usage of the timer structure.
>
> For example we aren't holding base->lock in rpc_delete_timer()
> as would normally be expected. The reason this is safe is
> the transition to "timer off base" and "timer.base <- NULL"
> follows this sequence in __run_timers(). So the worst we
> can do is to be racing with another cpu trying to expire this
> timer and we will delete an already inactive timer which is
> innocuous here.

CPU_0 CPU_1

rpc_release_task(task)
__run_timers:
timer->base = NULL;
rpc_delete_timer()
if (timer->base)
// not taken
del_timer_sync();

mempool_free(task);
timer->function(task):
// task already freed/reused
__rpc_default_timer(task);


This is totally pointless to do:

if (timer_pending())
del_singleshot_timer_sync();

If you need to ensure that timer's handler is not running on any
cpu then timer_pending() can't help. If you don't need this, you
should use plain del_timer().

> > Looking at the code for the first time, I can guess that RPC_TASK_HAS_TIMER
> > was invented to indicate when it is safe not to wait for timer
> > completition. This bit is cleared after ->tk_timeout_fn call in
> > the timer's handler. After that rpc_run_timer will not touch *task.
>
> The problem being it is possible in the now preemptible
> context in which rpc_run_timer() executes for the call
> of callback(task) to be preempted allowing another task
> restart the timer (via a call to __rpc_add_timer()).
> So the previous implied assumption in a nonpreemptive
> rpc_run_timer() of the timer known to be inactive (as
> that's how we arrived here) is no longer reliable.

I don't understand why this race is rt specific. I think
that PREEMPT_SOFTIRQS just enlarges the window. I believe
the right fix is just to call del_singleshot_timer_sync()
unconditionally and kill RPC_TASK_HAS_TIMER bit.

I have added Olaf Kirch to the CC: list.

> The fix removes the attempt to replicate timer queue
> status from the RPC code by removing the usage of the
> pre-calculated RPC_TASK_HAS_TIMER.

No, RPC_TASK_HAS_TIMER doesn't replicate timer queue status.
I believe, it was intended as optimization, to avoid costly
del_timer_sync() call.

Oleg.

2005-05-28 14:04:16

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

Oleg Nesterov wrote:

[race scenario deleted]

This was not the race scenario of concern.

> This is totally pointless to do:
>
> if (timer_pending())
> del_singleshot_timer_sync();

However this is the action the RPC code was attempting
to take originally based upon stale/incorrect data.
Again the failure mode was of the RPC_TASK_HAS_TIMER
bit not being set when the embedded timer actually was
still present in the timer cascade structure (eg:
timer->base != 0 or timer_pending()). This caused the
rpc_task to be recycled with the embedded timer struct
still linked in the timer cascade.

> If you need to ensure that timer's handler is not running on any
> cpu then timer_pending() can't help. If you don't need this, you
> should use plain del_timer().

That's not the goal of the timer_pending() usage here.
Rather we're at a point in rpc_release_task where we
want to tear down an rpc_task. The call to timer_pending()
determines if the embedded timer is still linked in the
timer cascade structure.

The embedded timer may be running on another CPU. We
may even check timer->base during this race when it is
non-NULL and wind up deleting a timer which the path
running on another CPU has already deleted. That is
allowable. We simply want to idle the timer.

> I don't understand why this race is rt specific. I think
> that PREEMPT_SOFTIRQS just enlarges the window.

That is certainly possible. However this code has been
around for some time and this problem apparently hasn't
otherwise surfaced. I suspect it depends on the
combination of timer cascade structures existing
per-CPU and the execution context of callback(task)
invoked in rpc_run_timer() now being preemptive (ie:
preemption is required on the same CPU for this
problem to surface).

> I believe
> the right fix is just to call del_singleshot_timer_sync()
> unconditionally and kill RPC_TASK_HAS_TIMER bit.

Ignoring RPC_TASK_HAS_TIMER and always calling
del_singleshot_timer_sync() will work but seems
excessive.

> I have added Olaf Kirch to the CC: list.

Welcome.

>>The fix removes the attempt to replicate timer queue
>>status from the RPC code by removing the usage of the
>>pre-calculated RPC_TASK_HAS_TIMER.
>
>
> No, RPC_TASK_HAS_TIMER doesn't replicate timer queue status.
> I believe, it was intended as optimization, to avoid costly
> del_timer_sync() call.

If true its chosen name seems a misnomer. In any case
the embedded timer in the rpc_task structure must be
quiesced before the rpc_task is made available for
reuse. The fix under discussion here does so. Note the
goal was not to rearchitecture the RPC code but rather to
address this very specific bug which surfaced during
stress testing.

If anyone with more ownership of the RPC code would like
to comment, any insight would be most welcome.

-john


--
[email protected]

2005-05-28 16:26:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

john cooper wrote:
>
> Oleg Nesterov wrote:
>
> > If you need to ensure that timer's handler is not running on any
> > cpu then timer_pending() can't help. If you don't need this, you
> > should use plain del_timer().
>
> That's not the goal of the timer_pending() usage here.
> Rather we're at a point in rpc_release_task where we
> want to tear down an rpc_task. The call to timer_pending()
> determines if the embedded timer is still linked in the
> timer cascade structure.

Yes, I see what you are trying to fix. However, your fix
opens even worse bug.

> If anyone with more ownership of the RPC code would like
> to comment, any insight would be most welcome.

Yes. Trond, could you please look at this thread:
http://marc.theaimsgroup.com/?t=111590936700001 and put an
end to our discussion?

Oleg.

2005-05-28 17:49:44

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

Oleg Nesterov wrote:
>
> CPU_0 CPU_1
>
> rpc_release_task(task)
> __run_timers:
> timer->base = NULL;
> rpc_delete_timer()
> if (timer->base)
> // not taken
> del_timer_sync();
>
> mempool_free(task);
> timer->function(task):
> // task already freed/reused
> __rpc_default_timer(task);

Ah ok, I was misreading the above. Now I see your point.
The race if hit loses the synchronization provided by
del_timer_sync() in my case. Thanks for being persistent.

The other possibility to fix the original problem within
the scope of the RPC code was to replace the bit state of
RPC_TASK_HAS_TIMER with a count so we don't obscure the
fact the timer was requeued during the preemption window.
This happens as rpc_run_timer() does an unconditional
clear_bit(RPC_TASK_HAS_TIMER,..) before returning.

Another would be to check timer->base/timer_pending()
in rpc_run_timer() and to omit doing a clear_bit()
if the timer is found to have been requeued.

The former has simpler locking requirements but requires
an rpc_task data structure modification. The latter does
not and is a bit more intuitive but needs additional
locking. Both appear to provide a solution though I
haven't yet attempted either.

Anyone with more ownership of this code care to comment
on a preferred fix?

-john


--
[email protected]

2005-05-28 20:36:15

by Trond Myklebust

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:

> The other possibility to fix the original problem within
> the scope of the RPC code was to replace the bit state of
> RPC_TASK_HAS_TIMER with a count so we don't obscure the
> fact the timer was requeued during the preemption window.
> This happens as rpc_run_timer() does an unconditional
> clear_bit(RPC_TASK_HAS_TIMER,..) before returning.
>
> Another would be to check timer->base/timer_pending()
> in rpc_run_timer() and to omit doing a clear_bit()
> if the timer is found to have been requeued.

Could you please explain why you think such a scenario is possible? The
timer functions themselves should never be causing a re-queue, and every
iteration through the loop in __rpc_execute() should cause any pending
timer to be killed, as should rpc_release_task().

That's why we can use del_singleshot_timer_sync() in the first place:
because there is no recursion, and no re-queueing that can cause races.
I don't see how either preemption or RT will change that (and if they
do, then _that_ is the real bug that needs fixing).

Cheers,
Trond

2005-05-28 22:17:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:
> The other possibility to fix the original problem within
> the scope of the RPC code was to replace the bit state of
> RPC_TASK_HAS_TIMER with a count so we don't obscure the
> fact the timer was requeued during the preemption window.
> This happens as rpc_run_timer() does an unconditional
> clear_bit(RPC_TASK_HAS_TIMER,..) before returning.

Just to clarify a bit more here. RPC_TASK_HAS_TIMER is definitely _not_
redundantly replicating timer state, as I saw you asserting previously.

There are two distinct cases where timer->base may be NULL. In the first
case, the timer is inactive and was never queued so it is entirely
pointless to be calling del_timer() & friends. In the second case, the
timer was active, but has expired and was picked up by __run_timers().
del_singleshot_timer_sync() fails to distinguish these two cases, and
will therefore end up redundantly calling del_timer_sync() every time a
task loops through __rpc_execute() without setting a timer. Your patch
to remove RPC_TASK_HAS_TIMER will re-introduce that redundant behaviour
but will re-introduce precisely the race that del_timer_sync() is
designed to fix.

So, I repeat: I want to see a case description that clearly demonstrates
a race scenario in the existing code before I'm willing to consider any
patches at all.

Cheers,
Trond

2005-05-29 03:14:25

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

Trond Myklebust wrote:
> lau den 28.05.2005 Klokka 13:48 (-0400) skreiv john cooper:
> Could you please explain why you think such a scenario is possible? The
> timer functions themselves should never be causing a re-queue, and every
> iteration through the loop in __rpc_execute() should cause any pending
> timer to be killed, as should rpc_release_task().

I'm trying to pinpoint the cause of the RPC timer cascade
structure corruption. Here is the data I've ascertained thus far:

1. the failure only has been seen for the timer struct embedded
in an rpc_task. The callback function is always rpc_run_timer().

2. the failure mode is an rpc_task's embedded timer struct being
queued a second time in the timer cascade without having
been dequeued from an existing cascade vector. The failure
is not immediate but causes a corrupt entry which is later
encountered in timer cascade code.

3. the problem occurs if rpc_run_timer() executes in preemptable
context. The problem was not encountered when preemption was
explicitly disabled within rpc_run_timer().

4. the problem appears related to RPC_TASK_HAS_TIMER not being set
in rpc_release_task(). Specifically the problem arises when
!RPC_TASK_HAS_TIMER but timer->base is non-zero. Modifying
rpc_delete_timer() to effect the del_singleshot_timer_sync()
call when (RPC_TASK_HAS_TIMER || timer->base) prevents the
failure.

5. (a detail of #2 above) Instrumenting the number of logical
add/mod/del operations on the timer struct the corruption
occurs at a point when the tally of the operations:
(mod == add == del + 1) ie: the timer is has been add/modified
several times but one count greater than the number of delete
operations (active in cascade). The next operation on this
timer struct is an add/mod operation with the state of the
timer having been reinitialized in rpc_init_task():init_timer()
with no intervening delete having been performed.

If it wasn't clear in my prior mail, please disregard the earlier
claim of RPC_TASK_HAS_TIMER replicating the state of timer->base.
From Oleg's earlier mail I see that isn't the case as additional
RPC-specific state is attached to this flag. The patch as well
should be disregarded.

> That's why we can use del_singleshot_timer_sync() in the first place:
> because there is no recursion, and no re-queueing that can cause races.
> I don't see how either preemption or RT will change that (and if they
> do, then _that_ is the real bug that needs fixing).

During the time I have been hunting this bug I've lost
count of the number of times I've alternatively suspected
either the kernel timer code or net/sunrpc/sched.c usage
of the same. I still feel it is somehow related to the RPC
code but will need to refine the instrumentation to extract
further information from the failure scenario.

A few questions I'd like to pose:

1. Can you correlate anything of rpc_run_timer() running in
preemptive context which could explain the above behavior?

2. Do you agree that (!RPC_TASK_HAS_TIMER && timer->base)
is an inconsistent state at the time of
rpc_release_task():rpc_delete_timer() ?


-john


--
[email protected]

2005-05-29 07:41:14

by Trond Myklebust

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

sched.c | 8 ++++++++
1 files changed, 8 insertions(+)

Index: linux-2.6.12-rc4/net/sunrpc/sched.c
===================================================================
--- linux-2.6.12-rc4.orig/net/sunrpc/sched.c
+++ linux-2.6.12-rc4/net/sunrpc/sched.c
@@ -135,6 +135,8 @@ __rpc_add_timer(struct rpc_task *task, r
static void
rpc_delete_timer(struct rpc_task *task)
{
+ BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) == 0 &&
+ timer_pending(&task->tk_timer));
if (RPC_IS_QUEUED(task))
return;
if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
@@ -337,6 +339,8 @@ static void __rpc_sleep_on(struct rpc_wa
void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
rpc_action action, rpc_action timer)
{
+ BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+ timer_pending(&task->tk_timer));
/*
* Protect the queue operations.
*/
@@ -594,6 +598,8 @@ static int __rpc_execute(struct rpc_task
unlock_kernel();
}

+ BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+ timer_pending(&task->tk_timer));
/*
* Perform the next FSM step.
* tk_action may be NULL when the task has been killed
@@ -925,6 +931,8 @@ fail:

void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
{
+ BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+ timer_pending(&task->tk_timer));
spin_lock_bh(&childq.lock);
/* N.B. Is it possible for the child to have already finished? */
__rpc_sleep_on(&childq, task, func, NULL);


Attachments:
linux-2.6.12-test_timer.dif (1.52 kB)

2005-05-29 11:23:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

__rpc_execute() calls ->tk_exit and goes to 'restarted' label.
What if ->tk_exit calls rpc_sleep_on() ? If it is possible, we
have a race.

CPU_0 (main loop in rpc_execute restarted) CPU_1

rpc_delete_timer:
if (RPC_IS_QUEUED()) // YES
return;
rpc_run_timer:
rpc_wake_up_task:
clear_bit(RPC_TASK_QUEUED)
preemption, or long interrupt

if (!RPC_IS_QUEUED()) // YES
task->tk_action()
__rpc_add_timer:
set_bit(RPC_TASK_HAS_TIMER)
mod_timer();
clear_bit(RPC_TASK_HAS_TIMER)

Now we have pending timer without RPC_TASK_HAS_TIMER bit set.

Is this patch makes any sense?

Oleg.

--- 2.6.12-rc5/net/sunrpc/sched.c~ Sun May 29 15:09:25 2005
+++ 2.6.12-rc5/net/sunrpc/sched.c Sun May 29 15:11:24 2005
@@ -135,8 +135,6 @@ __rpc_add_timer(struct rpc_task *task, r
static void
rpc_delete_timer(struct rpc_task *task)
{
- if (RPC_IS_QUEUED(task))
- return;
if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
del_singleshot_timer_sync(&task->tk_timer);
dprintk("RPC: %4d deleting timer\n", task->tk_pid);
@@ -566,7 +564,6 @@ static int __rpc_execute(struct rpc_task

BUG_ON(RPC_IS_QUEUED(task));

- restarted:
while (1) {
/*
* Garbage collection of pending timers...
@@ -607,6 +604,7 @@ static int __rpc_execute(struct rpc_task
unlock_kernel();
}

+ restarted:
/*
* Lockless check for whether task is sleeping or not.
*/

2005-05-29 13:58:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

su den 29.05.2005 Klokka 15:31 (+0400) skreiv Oleg Nesterov:
> __rpc_execute() calls ->tk_exit and goes to 'restarted' label.
> What if ->tk_exit calls rpc_sleep_on() ? If it is possible, we
> have a race.
>
> CPU_0 (main loop in rpc_execute restarted) CPU_1
>
> rpc_delete_timer:
> if (RPC_IS_QUEUED()) // YES
> return;
> rpc_run_timer:
> rpc_wake_up_task:
> clear_bit(RPC_TASK_QUEUED)
> preemption, or long interrupt
>
> if (!RPC_IS_QUEUED()) // YES
> task->tk_action()
> __rpc_add_timer:
> set_bit(RPC_TASK_HAS_TIMER)
> mod_timer();
> clear_bit(RPC_TASK_HAS_TIMER)
>
> Now we have pending timer without RPC_TASK_HAS_TIMER bit set.

It is possible, but it should be _extremely_ rare. The only cases where
tk_exit() sets a timer are the cases where we call rpc_delay(). In the
existing cases in the kernel, that would be setting timers of 5 seconds
(for the case of JUKEBOX errors) or longer.

These tx_exit()+restart events are very rare in themselves (JUKEBOX
errors only occur if the server needs to do something with a long
latency - such as retrieving a file from an HSM unit), but for that to
coincide with a preemption or interrupt of > 5 seconds inside
__rpc_execute... I'd hate to see those RPC performance figures.

> Is this patch makes any sense?

Yes. I agree the scenario is theoretically possible (so I can queue that
patch up for you). I am not convinced it is a plausible explanation for
what John claims to be seeing, though.

Cheers,
Trond

2005-05-30 14:55:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: RT and Cascade interrupts


* Trond Myklebust <[email protected]> wrote:

> > Is this patch makes any sense?
>
> Yes. I agree the scenario is theoretically possible (so I can queue
> that patch up for you). I am not convinced it is a plausible
> explanation for what John claims to be seeing, though.

i've added this patch (and your debug asserts, except for the
rpc_delete_timer() one) to the -RT tree and i've removed the earlier
hack - perhaps John can re-run the test and see whether it still occurs
under -rc5-RT-V0.7.47-13 or later?

most races are much more likely to occur under PREEMPT_RT than under
other preemption models, but maybe there's something else going on as
well.

Ingo

--- linux/net/sunrpc/sched.c.orig
+++ linux/net/sunrpc/sched.c
@@ -135,8 +135,6 @@ __rpc_add_timer(struct rpc_task *task, r
static void
rpc_delete_timer(struct rpc_task *task)
{
- if (RPC_IS_QUEUED(task))
- return;
if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
del_singleshot_timer_sync(&task->tk_timer);
dprintk("RPC: %4d deleting timer\n", task->tk_pid);
@@ -337,6 +335,8 @@ static void __rpc_sleep_on(struct rpc_wa
void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
rpc_action action, rpc_action timer)
{
+ BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+ timer_pending(&task->tk_timer));
/*
* Protect the queue operations.
*/
@@ -566,7 +566,6 @@ static int __rpc_execute(struct rpc_task

BUG_ON(RPC_IS_QUEUED(task));

- restarted:
while (1) {
/*
* Garbage collection of pending timers...
@@ -594,6 +593,8 @@ static int __rpc_execute(struct rpc_task
unlock_kernel();
}

+ BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+ timer_pending(&task->tk_timer));
/*
* Perform the next FSM step.
* tk_action may be NULL when the task has been killed
@@ -607,6 +608,7 @@ static int __rpc_execute(struct rpc_task
unlock_kernel();
}

+ restarted:
/*
* Lockless check for whether task is sleeping or not.
*/
@@ -925,6 +927,8 @@ fail:

void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
{
+ BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
+ timer_pending(&task->tk_timer));
spin_lock_bh(&childq.lock);
/* N.B. Is it possible for the child to have already finished? */
__rpc_sleep_on(&childq, task, func, NULL);

2005-05-30 21:35:49

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

Trond Myklebust wrote:

> I've appended a patch that
> should check for strict compliance of the above rules. Could you try it
> out and see if it triggers any Oopses?

Yes, the assert in rpc_delete_timer() occurs just before
the cascade list corruption. This is consistent with
what I have seen. ie: the timer in a released rpc_task
is still active.

BTW, the patch from Oleg is relative to 2.6.12 and didn't
look to apply to the 2.6.11-derived base with which I'm
working (the RPC_IS_QUEUED() test at the head of rpc_delete_timer()
does not exist). In any case the relocation of restarted: in
__rpc_execute() did not influence the failure. I'd like to
move to a 2.6.12-based -RT patch however I'm dealing with
"code in the pipe" and unfortunately don't have that option.

Sorry I'm just responding new. We're in the middle of a
long holiday weekend. I will have more time come tomorrow
to analyze this further.

-john


> ------------------------------------------------------------------------
>
> sched.c | 8 ++++++++
> 1 files changed, 8 insertions(+)
>
> Index: linux-2.6.12-rc4/net/sunrpc/sched.c
> ===================================================================
> --- linux-2.6.12-rc4.orig/net/sunrpc/sched.c
> +++ linux-2.6.12-rc4/net/sunrpc/sched.c
> @@ -135,6 +135,8 @@ __rpc_add_timer(struct rpc_task *task, r
> static void
> rpc_delete_timer(struct rpc_task *task)
> {
> + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) == 0 &&
> + timer_pending(&task->tk_timer));
> if (RPC_IS_QUEUED(task))
> return;
> if (test_and_clear_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate)) {
> @@ -337,6 +339,8 @@ static void __rpc_sleep_on(struct rpc_wa
> void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
> rpc_action action, rpc_action timer)
> {
> + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
> + timer_pending(&task->tk_timer));
> /*
> * Protect the queue operations.
> */
> @@ -594,6 +598,8 @@ static int __rpc_execute(struct rpc_task
> unlock_kernel();
> }
>
> + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
> + timer_pending(&task->tk_timer));
> /*
> * Perform the next FSM step.
> * tk_action may be NULL when the task has been killed
> @@ -925,6 +931,8 @@ fail:
>
> void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
> {
> + BUG_ON(test_bit(RPC_TASK_HAS_TIMER, &task->tk_runstate) != 0 ||
> + timer_pending(&task->tk_timer));
> spin_lock_bh(&childq.lock);
> /* N.B. Is it possible for the child to have already finished? */
> __rpc_sleep_on(&childq, task, func, NULL);


--
[email protected]

2005-05-31 23:11:42

by john cooper

[permalink] [raw]
Subject: Re: RT and Cascade interrupts

john cooper wrote:
> Trond Myklebust wrote:
>
>> I've appended a patch that
>> should check for strict compliance of the above rules. Could you try it
>> out and see if it triggers any Oopses?
>
>
> Yes, the assert in rpc_delete_timer() occurs just before
> the cascade list corruption. This is consistent with
> what I have seen. ie: the timer in a released rpc_task
> is still active.

I've captured more data in the instrumentation and found
the rpc_task's timer is being requeued by an application
task which is preempting ksoftirqd when it wakes up in
xprt_transmit(). This is what I had originally suspected
but likely didn't communicate it effectively.

The scenario unfolds as:

[high priority app task]
:
call_transmit()
xprt_transmit()
/* blocks in xprt_transmit() */
ksoftirqd()
__run_timers()
list_del("rpc_task_X.timer") /* logically off cascade */
rpc_run_timer(data)
task->tk_timeout_fn(task)

/* ksoftirqd preempted */

:
---------------------------------------------------------
/* Don't race with disconnect */
if (!xprt_connected(xprt))
task->tk_status = -ENOTCONN;
else if (!req->rq_received)
rpc_sleep_on(&xprt->pending, task, NULL, xprt_timer);
---------------------------------------------------------
__rpc_sleep_on()
__mod_timer("rpc_task_X.timer") /* requeued in cascade */
/* blocks */

/* rpc_run_timer resumes from preempt */
clear_bit(RPC_TASK_HAS_TIMER, "rpc_task_X.tk_runstate");

/* rpc_task_X.timer is now enqueued in cascade without
RPC_TASK_HAS_TIMER set and will not be dequeued
in rpc_release_task()/rpc_delete_timer() */


The usage of "rpc_task_X.timer" indicates the same KVA
observed for the timer struct at the associated points
in the instrumented code.

The above was gathered by logging usage of the
kernel/timer.c primitives. Thus I don't have more
detailed state of the rpc_task in RPC context.
However I did verify which of the three calls to
rpc_sleep_on() in xprt_transmit() was being invoked
(as above).

So the root cause appears to be the rpc_task's timer
being requeued in xprt_transmit() when rpc_run_timer
is preempted. From looking at the code I'm unsure
if modifying xprt_transmit()/out_receive is appropriate
to synchronize with rpc_release_task(). It seems
allowing rpc_sleep_on() to occur is more natural and
for rpc_release_task() to detect the pending timer and
remove it before proceeding. I'm still in the process
of trying to digest the logic here but I thought there
was enough information here to be of use. Suggestions,
warnings welcome.

-john


--
[email protected]