2007-03-05 13:47:10

by Jan Rękorajski

[permalink] [raw]
Subject: Oops in lockd (2.6.20.1)

Hi,
I have NFS server, serving /, /home and /usr to some 100 workstations,
and the following oops bit me today in the morning.

yellow11 is a workstation (probably someone rebooted it with a power
switch).
Server is 4xDualCore Opteron with 32GB RAM and 500GB of exported
filesystems .

This kernel is heavily patched, but nothing AFAIK touches NFS or
anything near it.

lockd: server yellow11 not responding, timed out
lockd: couldn't create RPC handle for yellow11
lockd: server yellow11 not responding, timed out
lockd: couldn't create RPC handle for yellow11
general protection fault: 0000 [1] SMP
CPU 0
Modules linked in: vroot nfsd exportfs lockd nfs_acl sunrpc tg3 ext2 mbcache dm_mod 8021q
Pid: 4114:#0, comm: lockd Not tainted 2.6.20.1 #3
RIP: 0010:[<ffffffff8807e058>] [<ffffffff8807e058>] :lockd:nlm_release_host+0x28/0x110
RSP: 0018:ffff8107c5075e40 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0008000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0008000000000000
RBP: ffff81077ce4d080 R08: 0000000000000000 R09: 0000000000000000
R10: 2222222222222222 R11: 2222222222222222 R12: 00000000ffffffdb
R13: 000000000000000a R14: ffff8107c5075ea0 R15: ffffffff880849c0
FS: 00002b5d76afeb00(0000) GS:ffffffff805e0000(0000) knlGS:00000000f7b5a6c0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000f7ee0ba9 CR3: 00000005d50db000 CR4: 00000000000006e0
Process lockd (pid: 4114[#0], threadinfo ffff8107c5074000, task ffff8107d8e1c100)
Stack: ffff8107b75ab000 ffffffff8807cc1d ffff8107b75ab000 ffffffff8807cfca
0000000000000000 ffff8107b75ab038 ffff8107da3132c0 ffff8107b75ab000
ffff8106c24bd4c0 ffffffff8807e7a0 0000000000000000 ffffffff8807d032
Call Trace:
[<ffffffff8807cc1d>] :lockd:nlm_release_call+0xd/0x20
[<ffffffff8807cfca>] :lockd:__nlm_async_call+0x9a/0xc0
[<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
[<ffffffff8807d032>] :lockd:nlm_async_call+0x42/0x50
[<ffffffff8807fbae>] :lockd:nlmsvc_grant_blocked+0x12e/0x170
[<ffffffff8807fdd3>] :lockd:nlmsvc_retry_blocked+0x73/0xa0
[<ffffffff8807e8d7>] :lockd:lockd+0x137/0x280
[<ffffffff80253df8>] child_rip+0xa/0x12
[<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
[<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
[<ffffffff80253dee>] child_rip+0x0/0x12


Code: 8b 73 74 85 f6 79 25 48 c7 c1 60 47 08 88 ba 24 01 00 00 48
RIP [<ffffffff8807e058>] :lockd:nlm_release_host+0x28/0x110
RSP <ffff8107c5075e40>

--
Jan Rekorajski | ALL SUSPECTS ARE GUILTY. PERIOD!
baggins<at>mimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY?
BOFH, MANIAC | -- TROOPS by Kevin Rubio

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2007-03-05 15:37:10

by Olaf Kirch

[permalink] [raw]
Subject: Re: Oops in lockd (2.6.20.1)

On Monday 05 March 2007 14:47, Jan Rekorajski wrote:
> [<ffffffff8807cc1d>] :lockd:nlm_release_call+0xd/0x20
> [<ffffffff8807cfca>] :lockd:__nlm_async_call+0x9a/0xc0
> [<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
> [<ffffffff8807d032>] :lockd:nlm_async_call+0x42/0x50
> [<ffffffff8807fbae>] :lockd:nlmsvc_grant_blocked+0x12e/0x170
> [<ffffffff8807fdd3>] :lockd:nlmsvc_retry_blocked+0x73/0xa0
> [<ffffffff8807e8d7>] :lockd:lockd+0x137/0x280
> [<ffffffff80253df8>] child_rip+0xa/0x12
> [<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
> [<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
> [<ffffffff80253dee>] child_rip+0x0/0x12
>
>
> Code: 8b 73 74 85 f6 79 25 48 c7 c1 60 47 08 88 ba 24 01 00 00 48
> RIP [<ffffffff8807e058>] :lockd:nlm_release_host+0x28/0x110
> RSP <ffff8107c5075e40>

It seems it's dying on a bogus nlm_host pointer. In __nlm_async_call we have:

status = rpc_call_async(clnt, msg, RPC_TASK_ASYNC, tk_ops, req);
if (status == 0)
return 0;
out_err:
nlm_release_call(req);
return status;
}

So we ended up in nlm_release_call because rpc_call_async returned an
error. However, when rpc_call_async fails, it calls the rpc_call_done handler.
In this case, it's nlmsvc_release_block, which ends up freeing the nlm_block
object, which does nlm_release_call(block->b_call).

So it appears to me that it's a double free.

IMHO, when rpc_call_async fails, we should just return status right away.

A similar argument applies to nlmsvc_grant_blocked, where we try to
release the nlm_block object if nlm_async_call() returns an error. When
we get there, the block object will have been freed already.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-03-05 16:18:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: Oops in lockd (2.6.20.1)

On Mon, 2007-03-05 at 16:36 +0100, Olaf Kirch wrote:
> On Monday 05 March 2007 14:47, Jan Rekorajski wrote:
> > [<ffffffff8807cc1d>] :lockd:nlm_release_call+0xd/0x20
> > [<ffffffff8807cfca>] :lockd:__nlm_async_call+0x9a/0xc0
> > [<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
> > [<ffffffff8807d032>] :lockd:nlm_async_call+0x42/0x50
> > [<ffffffff8807fbae>] :lockd:nlmsvc_grant_blocked+0x12e/0x170
> > [<ffffffff8807fdd3>] :lockd:nlmsvc_retry_blocked+0x73/0xa0
> > [<ffffffff8807e8d7>] :lockd:lockd+0x137/0x280
> > [<ffffffff80253df8>] child_rip+0xa/0x12
> > [<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
> > [<ffffffff8807e7a0>] :lockd:lockd+0x0/0x280
> > [<ffffffff80253dee>] child_rip+0x0/0x12
> >
> >
> > Code: 8b 73 74 85 f6 79 25 48 c7 c1 60 47 08 88 ba 24 01 00 00 48
> > RIP [<ffffffff8807e058>] :lockd:nlm_release_host+0x28/0x110
> > RSP <ffff8107c5075e40>
>
> It seems it's dying on a bogus nlm_host pointer. In __nlm_async_call we have:
>
> status = rpc_call_async(clnt, msg, RPC_TASK_ASYNC, tk_ops, req);
> if (status == 0)
> return 0;
> out_err:
> nlm_release_call(req);
> return status;
> }
>
> So we ended up in nlm_release_call because rpc_call_async returned an
> error. However, when rpc_call_async fails, it calls the rpc_call_done handler.
> In this case, it's nlmsvc_release_block, which ends up freeing the nlm_block
> object, which does nlm_release_call(block->b_call).
>
> So it appears to me that it's a double free.
>
> IMHO, when rpc_call_async fails, we should just return status right away.
>
> A similar argument applies to nlmsvc_grant_blocked, where we try to
> release the nlm_block object if nlm_async_call() returns an error. When
> we get there, the block object will have been freed already.
>
> Olaf

Already fixed in 2.6.21-rc1 mainline. See

http://client.linux-nfs.org/Linux-2.6.x/2.6.20/linux-2.6.20-001-fix_nlm_async_call.dif

Cheers,
Trond


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-03-05 17:51:51

by Olaf Kirch

[permalink] [raw]
Subject: Re: Oops in lockd (2.6.20.1)

On Monday 05 March 2007 17:18, Trond Myklebust wrote:
> Already fixed in 2.6.21-rc1 mainline.

Nice :)

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-03-05 17:54:38

by Jan Rękorajski

[permalink] [raw]
Subject: Re: Oops in lockd (2.6.20.1)

On Mon, 05 Mar 2007, Trond Myklebust wrote:

> Already fixed in 2.6.21-rc1 mainline. See
>
> http://client.linux-nfs.org/Linux-2.6.x/2.6.20/linux-2.6.20-001-fix_nlm_async_call.dif

Thank you :)
I'll test it tomorrow and I really hope my problems with
unresponsive/dying lockd will go away at last.

Jan
--
Jan Rekorajski | ALL SUSPECTS ARE GUILTY. PERIOD!
baggins<at>mimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY?
BOFH, MANIAC | -- TROOPS by Kevin Rubio

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs