2003-02-07 12:21:49

by Jakob Oestergaard

[permalink] [raw]
Subject: Race in RPC code


Hello all,

I think there is a race in the RPC code in 2.4.20, related to timeout
and congestion window handling.

The problem code is in net/sunrpc/xprt.c:

static void
xprt_timer(struct rpc_task *task)
{
struct rpc_rqst *req = task->tk_rqstp;
struct rpc_xprt *xprt = req->rq_xprt;

spin_lock(&xprt->sock_lock);
if (req->rq_received)
goto out;

if (!xprt->nocong) {
if (xprt_expbackoff(task, req)) {
rpc_add_timer(task, xprt_timer);
goto out_unlock;
}
rpc_inc_timeo(&task->tk_client->cl_rtt);
xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT);
}
req->rq_nresend++;

The call to xprt_adjust_cwnd is the problem - I experienced a panic
(null pointer dereference) in

static void
xprt_adjust_cwnd(struct rpc_xprt *xprt, int result)
{
unsigned long cwnd;

cwnd = xprt->cwnd;
if (result >= 0 && cwnd <= xprt->cong) {

Here it is the "cwnd = xprt->cwnd" that causes the panic. xprt was 0.

This means, in the xprt_timer code, that req->rq_xprt must have been 0.

I guess this can happen because of the sequence:

xprt = req->rq_xprt;
spin_lock(&xprt->sock_lock);
...
xprt_adjust_cwnd(req->rq_xprt);

We don't know whether req has been modified between the assignment and
the spin_lock.

Attached is a patch to solve the problem - please comment.

It does not solve the other potential (?) problem with:

xprt = req->rq_xprt;
spin_lock(&xprt->sock_lock);
...
if (xprt_expbackoff(task, req)) {
...

Any suggestions to that one?

I cannot test whether my patch solve the problem, because this panic has
happened once on a *heavily* loaded box which has run 2.4.20 ever since
it came out. The race is extremely rare. It is an SMP box by the way.

Thanks a lot to "baldrick" on kernel-newbies for the help!

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:


Attachments:
(No filename) (2.27 kB)
race-diff (335.00 B)
Download all attachments

2003-02-07 13:12:28

by Trond Myklebust

[permalink] [raw]
Subject: Race in RPC code

>>>>> " " == Jakob Oestergaard <[email protected]> writes:


> We don't know whether req has been modified between the
> assignment and the spin_lock.

It had better not be. If it is, then I want to know where so that we
can fix it.

req->rq_xprt is set up when the request is initialized. It
is not meant to change until the request gets released. This again
should not happen while the request is still on the wait queue.

IOW the fix you propose would just be papering over another problem.

Cheers,
Trond

2003-02-07 13:35:09

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: Race in RPC code

On Fri, Feb 07, 2003 at 02:21:50PM +0100, Trond Myklebust wrote:
> >>>>> " " == Jakob Oestergaard <[email protected]> writes:
>
>
> > We don't know whether req has been modified between the
> > assignment and the spin_lock.
>
> It had better not be. If it is, then I want to know where so that we
> can fix it.
>
> req->rq_xprt is set up when the request is initialized. It
> is not meant to change until the request gets released. This again
> should not happen while the request is still on the wait queue.
>
> IOW the fix you propose would just be papering over another problem.

Any suggestions as to how it could happen?

The box is running huge compile jobs (>100MB memory used by each
compiler - runs 2-3 compilers concurrently) all day long - we never had
a GCC sig11 error. It has 512 MB of ECC memory (and ECC is enabled) - I
seriously doubt that we have a memory corruption problem.

The panic has happened once, just today.

I will be happy to try other solutions, but I can't verify whether they
work - I mean, if the box runs another few months without crashing that
doesn't really prove anything...

Thanks for commenting!

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2003-02-07 18:08:34

by Ion Badulescu

[permalink] [raw]
Subject: Re: Race in RPC code

Hi Jakob, Trond,

On Fri, 7 Feb 2003 14:44:46 +0100, Jakob Oestergaard <[email protected]> wrote:

> The panic has happened once, just today.

I've seen this multiple times, and even reported it to the nfs mailing list.

I'm just glad someone else is seeing the same kind of oops with a vanilla
kernel, because I was seeing it with a heavily patched
redhat+nfs2.4.20+nfsall2.4.20 kernel and wasn't sure if that had anything
to do with it.

I have at least 5-6 such oopsen recorded, if anyone cares.

Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

2003-02-07 19:41:51

by Duncan Sands

[permalink] [raw]
Subject: Re: Race in RPC code

On Friday 07 February 2003 19:18, Ion Badulescu wrote:
> Hi Jakob, Trond,
>
> On Fri, 7 Feb 2003 14:44:46 +0100, Jakob Oestergaard <[email protected]> wrote:
> > The panic has happened once, just today.
>
> I've seen this multiple times, and even reported it to the nfs mailing
> list.
>
> I'm just glad someone else is seeing the same kind of oops with a vanilla
> kernel, because I was seeing it with a heavily patched
> redhat+nfs2.4.20+nfsall2.4.20 kernel and wasn't sure if that had anything
> to do with it.
>
> I have at least 5-6 such oopsen recorded, if anyone cares.

Please send them.

Duncan.