2005-12-16 20:55:38

by Ryan Richter

[permalink] [raw]
Subject: lockd: couldn't create RPC handle for (host)

Hi, nfs locking stopped working on my file server running 2.6.15-rc5
today. All clients that try locking operations hang, and I get the
message from the server:

lockd: couldn't create RPC handle for w.x.y.z

Also, the lockd process is unkillable, it looks like I'll have to
reboot.

The nfs mounts look like:

rw,nosuid,nodev,v3,rsize=8192,wsize=8192,hard,intr,udp,lock

I've recently posted the kernel .config and dmesg for both the server
and clients for other bug reports. For the server:

Message-ID: <[email protected]>

and the clients:

Message-ID: <[email protected]>

Any help would be appreciated.

Thanks,
-ryan


2005-12-16 23:53:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Fri, 2005-12-16 at 15:55 -0500, Ryan Richter wrote:
> Hi, nfs locking stopped working on my file server running 2.6.15-rc5
> today. All clients that try locking operations hang, and I get the
> message from the server:
>
> lockd: couldn't create RPC handle for w.x.y.z

Points either to a client which is not responding to callbacks, or an
OOM situation on the server.

Does 'rpcinfo -u w.x.y.z 100021' work from the server?

> Also, the lockd process is unkillable, it looks like I'll have to
> reboot.

lockd cannot be killed as long as you have active nfsd threads or active
nfs client partitions. That is by design.

Cheers,
Trond

2005-12-16 23:58:47

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Fri, Dec 16, 2005 at 06:49:05PM -0500, Trond Myklebust wrote:
> On Fri, 2005-12-16 at 15:55 -0500, Ryan Richter wrote:
> > Hi, nfs locking stopped working on my file server running 2.6.15-rc5
> > today. All clients that try locking operations hang, and I get the
> > message from the server:
> >
> > lockd: couldn't create RPC handle for w.x.y.z
>
> Points either to a client which is not responding to callbacks, or an
> OOM situation on the server.
>
> Does 'rpcinfo -u w.x.y.z 100021' work from the server?

No.

$ rpcinfo -u jacquere 100021
rpcinfo: RPC: Timed out
program 100021 version 0 is not available
zsh: exit 1 rpcinfo -u jacquere 100021

So I see now lockd is not present on the client. But...

$ rpcinfo -p jacquere
program vers proto port
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100021 1 udp 32768 nlockmgr
100021 3 udp 32768 nlockmgr
100021 4 udp 32768 nlockmgr
100024 1 udp 867 status
100024 1 tcp 870 status

So what does that mean?

> > Also, the lockd process is unkillable, it looks like I'll have to
> > reboot.
>
> lockd cannot be killed as long as you have active nfsd threads or active
> nfs client partitions. That is by design.

OK, that's good to know.

Thanks,
-ryan

2005-12-17 05:33:20

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Fri, 2005-12-16 at 18:58 -0500, Ryan Richter wrote:
> On Fri, Dec 16, 2005 at 06:49:05PM -0500, Trond Myklebust wrote:
> > On Fri, 2005-12-16 at 15:55 -0500, Ryan Richter wrote:
> > > Hi, nfs locking stopped working on my file server running 2.6.15-rc5
> > > today. All clients that try locking operations hang, and I get the
> > > message from the server:
> > >
> > > lockd: couldn't create RPC handle for w.x.y.z
> >
> > Points either to a client which is not responding to callbacks, or an
> > OOM situation on the server.
> >
> > Does 'rpcinfo -u w.x.y.z 100021' work from the server?
>
> No.
>
> $ rpcinfo -u jacquere 100021
> rpcinfo: RPC: Timed out
> program 100021 version 0 is not available
> zsh: exit 1 rpcinfo -u jacquere 100021
>
> So I see now lockd is not present on the client. But...
>
> $ rpcinfo -p jacquere
> program vers proto port
> 100000 2 tcp 111 portmapper
> 100000 2 udp 111 portmapper
> 100021 1 udp 32768 nlockmgr
> 100021 3 udp 32768 nlockmgr
> 100021 4 udp 32768 nlockmgr
> 100024 1 udp 867 status
> 100024 1 tcp 870 status
>
> So what does that mean?

Looks to me as if port 111 is pingable (since you can talk to the
portmapper), but port 32768 is not. Are you using port filtering or
firewalling anywhere (on the client, server, or switches)?

Cheers,
Trond

2005-12-17 05:59:13

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sat, Dec 17, 2005 at 12:32:57AM -0500, Trond Myklebust wrote:
> On Fri, 2005-12-16 at 18:58 -0500, Ryan Richter wrote:
> > On Fri, Dec 16, 2005 at 06:49:05PM -0500, Trond Myklebust wrote:
> > > On Fri, 2005-12-16 at 15:55 -0500, Ryan Richter wrote:
> > > > Hi, nfs locking stopped working on my file server running 2.6.15-rc5
> > > > today. All clients that try locking operations hang, and I get the
> > > > message from the server:
> > > >
> > > > lockd: couldn't create RPC handle for w.x.y.z
> > >
> > > Points either to a client which is not responding to callbacks, or an
> > > OOM situation on the server.
> > >
> > > Does 'rpcinfo -u w.x.y.z 100021' work from the server?
> >
> > No.
> >
> > $ rpcinfo -u jacquere 100021
> > rpcinfo: RPC: Timed out
> > program 100021 version 0 is not available
> > zsh: exit 1 rpcinfo -u jacquere 100021
> >
> > So I see now lockd is not present on the client. But...
> >
> > $ rpcinfo -p jacquere
> > program vers proto port
> > 100000 2 tcp 111 portmapper
> > 100000 2 udp 111 portmapper
> > 100021 1 udp 32768 nlockmgr
> > 100021 3 udp 32768 nlockmgr
> > 100021 4 udp 32768 nlockmgr
> > 100024 1 udp 867 status
> > 100024 1 tcp 870 status
> >
> > So what does that mean?
>
> Looks to me as if port 111 is pingable (since you can talk to the
> portmapper), but port 32768 is not. Are you using port filtering or
> firewalling anywhere (on the client, server, or switches)?

There's no filtering between the two. I get this on the machine itself:
$ rpcinfo -u localhost 100021
rpcinfo: RPC: Timed out
program 100021 version 0 is not available
zsh: exit 1 rpcinfo -u localhost 100021

There's no lockd process running on this client machine anymore.

On the server:

$ rpcinfo -u localhost 100021
program 100021 version 1 ready and waiting
rpcinfo: RPC: Program/version mismatch; low version = 1, high version =4
program 100021 version 2 is not available
program 100021 version 3 ready and waiting
program 100021 version 4 ready and waiting
zsh: exit 1 rpcinfo -u localhost 100021

Also neither machine is anywhere close to OOM.

-ryan

2005-12-17 06:43:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sat, 2005-12-17 at 00:59 -0500, Ryan Richter wrote:

> There's no filtering between the two. I get this on the machine itself:
> $ rpcinfo -u localhost 100021
> rpcinfo: RPC: Timed out
> program 100021 version 0 is not available
> zsh: exit 1 rpcinfo -u localhost 100021
>
> There's no lockd process running on this client machine anymore.

...yet the client still has the partition mounted, and isn't using the
-onolock option? That sounds weird.

Is the client also running 2.6.15-rc5?

Cheers,
Trond

2005-12-17 07:02:27

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sat, Dec 17, 2005 at 01:43:42AM -0500, Trond Myklebust wrote:
> On Sat, 2005-12-17 at 00:59 -0500, Ryan Richter wrote:
>
> > There's no filtering between the two. I get this on the machine itself:
> > $ rpcinfo -u localhost 100021
> > rpcinfo: RPC: Timed out
> > program 100021 version 0 is not available
> > zsh: exit 1 rpcinfo -u localhost 100021
> >
> > There's no lockd process running on this client machine anymore.
>
> ...yet the client still has the partition mounted, and isn't using the
> -onolock option? That sounds weird.

There are lots of nfs mounts, the root is ro nolock, but the trouble is
with the home directories which are rw lock. Everything is still
mounted, I can ssh in fine etc. The problem is with people using kde -
it tries to lock some file in the home directory during the login
process and hangs.

> Is the client also running 2.6.15-rc5?

No, 2.6.14.2.

-ryan

2005-12-17 19:28:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sat, 2005-12-17 at 02:02 -0500, Ryan Richter wrote:
> > > There's no filtering between the two. I get this on the machine itself:
> > > $ rpcinfo -u localhost 100021
> > > rpcinfo: RPC: Timed out
> > > program 100021 version 0 is not available
> > > zsh: exit 1 rpcinfo -u localhost 100021
> > >
> > > There's no lockd process running on this client machine anymore.
> >
> > ...yet the client still has the partition mounted, and isn't using the
> > -onolock option? That sounds weird.
>
> There are lots of nfs mounts, the root is ro nolock, but the trouble is
> with the home directories which are rw lock. Everything is still
> mounted, I can ssh in fine etc. The problem is with people using kde -
> it tries to lock some file in the home directory during the login
> process and hangs.

So what do you mean when you say that there is "no lockd process
running" on the client?

Is it not appearing in the output of 'ps -ef' either?

Is anything at all listening on port 32768 on 'jacquere'? (check using
'netstat -ap | grep 32768').

Could anything be playing around with the 'pmap_set' utility so as to
corrupt portmap?

Are you perhaps setting /proc/sys/fs/nfs/nlm_udpport and/or the kernel
parameter nfs.nlm_udpport? If so, to what?

Cheers,
Trond

2005-12-17 20:50:04

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sat, Dec 17, 2005 at 02:28:19PM -0500, Trond Myklebust wrote:
> On Sat, 2005-12-17 at 02:02 -0500, Ryan Richter wrote:
> > > > There's no filtering between the two. I get this on the machine itself:
> > > > $ rpcinfo -u localhost 100021
> > > > rpcinfo: RPC: Timed out
> > > > program 100021 version 0 is not available
> > > > zsh: exit 1 rpcinfo -u localhost 100021
> > > >
> > > > There's no lockd process running on this client machine anymore.
> > >
> > > ...yet the client still has the partition mounted, and isn't using the
> > > -onolock option? That sounds weird.
> >
> > There are lots of nfs mounts, the root is ro nolock, but the trouble is
> > with the home directories which are rw lock. Everything is still
> > mounted, I can ssh in fine etc. The problem is with people using kde -
> > it tries to lock some file in the home directory during the login
> > process and hangs.
>
> So what do you mean when you say that there is "no lockd process
> running" on the client?
>
> Is it not appearing in the output of 'ps -ef' either?

Nope.

$ ps -ef|grep lock
root 77 5 0 Nov20 ? 00:00:00 [kblockd/0]
foo 6811 6800 0 14:29 pts/0 00:00:00 grep -E lock

> Is anything at all listening on port 32768 on 'jacquere'? (check using
> 'netstat -ap | grep 32768').

Er... sort of?

# netstat -ap | grep 32768
udp 11144 0 *:32768 *:* -
I'm not sure what that means... lsof|grep 32768 returns nothing.

> Could anything be playing around with the 'pmap_set' utility so as to
> corrupt portmap?

Not that I know of.

> Are you perhaps setting /proc/sys/fs/nfs/nlm_udpport and/or the kernel
> parameter nfs.nlm_udpport? If so, to what?

Nope, that file contains a 0. I can't think of anything that would be
changing parameters like this.

There are several identical workstations like this, and all the ones
that are used regularly have lost their lockd. There's one that's
located in an inconvenient place and nobody ever uses it - this one
still has lockd running.

Thanks,
-ryan

2005-12-18 08:35:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sat, 2005-12-17 at 14:45 -0500, Ryan Richter wrote:
> > So what do you mean when you say that there is "no lockd process
> > running" on the client?
> >
> > Is it not appearing in the output of 'ps -ef' either?
>
> Nope.
>
> $ ps -ef|grep lock
> root 77 5 0 Nov20 ? 00:00:00 [kblockd/0]
> foo 6811 6800 0 14:29 pts/0 00:00:00 grep -E lock

Any Oopses? (use 'dmesg')

Could you also check dmesg for any entries of the form

'lockd: new process, skipping host shutdown'
or
'lockd_up: makesock failed, error='

> > Is anything at all listening on port 32768 on 'jacquere'? (check using
> > 'netstat -ap | grep 32768').
>
> Er... sort of?
>
> # netstat -ap | grep 32768
> udp 11144 0 *:32768 *:* -
> I'm not sure what that means... lsof|grep 32768 returns nothing.

That could be a kernel process. The RPC client has no reason to set up a
full file descriptor for the socket.

Could you please finally double-check that the entries in /proc/mounts
for your NFS mounts do contain the 'lock' mount option.

Finally, please do

echo 1 > /proc/sys/sunrpc/rpc_lockd
then unmount one of your NFS partitions, and then mount it again.


Cheers,
Trond

2005-12-18 18:59:22

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sun, Dec 18, 2005 at 03:33:56AM -0500, Trond Myklebust wrote:
> Any Oopses? (use 'dmesg')
>
> Could you also check dmesg for any entries of the form
>
> 'lockd: new process, skipping host shutdown'
> or
> 'lockd_up: makesock failed, error='

Well, there are the oopses I reported last week or so:

Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
<ffffffff801dbd9e>{nlmclnt_mark_reclaim+62}
PGD 7e0e7067 PUD 7e6c0067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in:
Pid: 1316, comm: lockd Not tainted 2.6.14.2 #2
RIP: 0010:[<ffffffff801dbd9e>] <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62}
RSP: 0018:ffff81007dfade70 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff81007ad74740 RCX: ffff81007e24a858
RDX: ffff81007e24a8f0 RSI: ffff81007e24a8e8 RDI: ffff81007ad74740
RBP: ffff81007e820e00 R08: 00000000fffffffa R09: 0000000000000001
R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffffffff803ec420 R15: ffff81007df62014
FS: 00002aaaab00c4a0(0000) GS:ffffffff804b6800(0000) knlGS:00000000557a9080
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000018 CR3: 000000007e113000 CR4: 00000000000006e0
Process lockd (pid: 1316, threadinfo ffff81007dfac000, task ffff81007eea61c0)
Stack: ffffffff801dbe6b ffff81007ad74740 ffffffff801e3d8c 3256cc84d3030002
0000000000000000 ffff81007df4fc68 ffff81007df4fc00 ffffffff803ed4a0
ffff81007df4fca0 ffff81007df4fc68
Call Trace:<ffffffff801dbe6b>{nlmclnt_recovery+139} <ffffffff801e3d8c>{nlm4svc_proc_sm_notify+188}
<ffffffff8034c5a4>{svc_process+884} <ffffffff8012ab40>{default_wake_function+0}
<ffffffff801dde00>{lockd+352} <ffffffff801ddca0>{lockd+0}
<ffffffff8010e352>{child_rip+8} <ffffffff801ddca0>{lockd+0}
<ffffffff801ddca0>{lockd+0} <ffffffff8010e34a>{child_rip+0}


Code: 48 39 78 18 75 1c 8b 86 8c 00 00 00 a8 01 74 12 83 c8 02 89
RIP <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62} RSP <ffff81007dfade70>
CR2: 0000000000000018

Every machine with a dead lockd has had this oops. Other stuff that
looks related (these came after the oops, a few days later):

lockd: unexpected unlock status: 1
lockd: weird return 7 for CANCEL call

Neither of those two messages you mentioned are present.

> > > Is anything at all listening on port 32768 on 'jacquere'? (check using
> > > 'netstat -ap | grep 32768').
> >
> > Er... sort of?
> >
> > # netstat -ap | grep 32768
> > udp 11144 0 *:32768 *:* -
> > I'm not sure what that means... lsof|grep 32768 returns nothing.
>
> That could be a kernel process. The RPC client has no reason to set up a
> full file descriptor for the socket.
>
> Could you please finally double-check that the entries in /proc/mounts
> for your NFS mounts do contain the 'lock' mount option.

Yep:

xarello:/export/home /export/home nfs rw,nosuid,nodev,v3,rsize=8192,wsize=8192,hard,intr,udp,lock,addr=xarello 0 0

> Finally, please do
>
> echo 1 > /proc/sys/sunrpc/rpc_lockd
> then unmount one of your NFS partitions, and then mount it again.

That file doesn't exist.

$ ls /proc/sys/sunrpc
nfs_debug nfsd_debug nlm_debug rpc_debug tcp_slot_table_entries udp_slot_table_entries

Thanks,
-ryan

2005-12-18 19:31:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sun, 2005-12-18 at 13:01 -0500, Ryan Richter wrote:
> On Sun, Dec 18, 2005 at 03:33:56AM -0500, Trond Myklebust wrote:
> > Any Oopses? (use 'dmesg')
> >
> > Could you also check dmesg for any entries of the form
> >
> > 'lockd: new process, skipping host shutdown'
> > or
> > 'lockd_up: makesock failed, error='
>
> Well, there are the oopses I reported last week or so:
>
> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62}
> PGD 7e0e7067 PUD 7e6c0067 PMD 0
> Oops: 0000 [1]
> CPU 0
> Modules linked in:
> Pid: 1316, comm: lockd Not tainted 2.6.14.2 #2
> RIP: 0010:[<ffffffff801dbd9e>] <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62}
> RSP: 0018:ffff81007dfade70 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff81007ad74740 RCX: ffff81007e24a858
> RDX: ffff81007e24a8f0 RSI: ffff81007e24a8e8 RDI: ffff81007ad74740
> RBP: ffff81007e820e00 R08: 00000000fffffffa R09: 0000000000000001
> R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: ffffffff803ec420 R15: ffff81007df62014
> FS: 00002aaaab00c4a0(0000) GS:ffffffff804b6800(0000) knlGS:00000000557a9080
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000018 CR3: 000000007e113000 CR4: 00000000000006e0
> Process lockd (pid: 1316, threadinfo ffff81007dfac000, task ffff81007eea61c0)
> Stack: ffffffff801dbe6b ffff81007ad74740 ffffffff801e3d8c 3256cc84d3030002
> 0000000000000000 ffff81007df4fc68 ffff81007df4fc00 ffffffff803ed4a0
> ffff81007df4fca0 ffff81007df4fc68
> Call Trace:<ffffffff801dbe6b>{nlmclnt_recovery+139} <ffffffff801e3d8c>{nlm4svc_proc_sm_notify+188}
> <ffffffff8034c5a4>{svc_process+884} <ffffffff8012ab40>{default_wake_function+0}
> <ffffffff801dde00>{lockd+352} <ffffffff801ddca0>{lockd+0}
> <ffffffff8010e352>{child_rip+8} <ffffffff801ddca0>{lockd+0}
> <ffffffff801ddca0>{lockd+0} <ffffffff8010e34a>{child_rip+0}
>
>
> Code: 48 39 78 18 75 1c 8b 86 8c 00 00 00 a8 01 74 12 83 c8 02 89
> RIP <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62} RSP <ffff81007dfade70>
> CR2: 0000000000000018

Looks like the global lock list is corrupted. Could you cat the contents
of /proc/locks?

> Every machine with a dead lockd has had this oops. Other stuff that
> looks related (these came after the oops, a few days later):

Those errors are unrelated. These errors come from the server.

> lockd: unexpected unlock status: 1
> lockd: weird return 7 for CANCEL call

Error "7" is the equivalent of "ESTALE" (stale filehandle). That means
either someone deleted the file you are trying to lock on the server, or
that a bug caused nfsd to somehow lose track of the file.

I suspect the Error "1" is related to the same issue.


> > Finally, please do
> >
> > echo 1 > /proc/sys/sunrpc/rpc_lockd
> > then unmount one of your NFS partitions, and then mount it again.
>
> That file doesn't exist.
>
> $ ls /proc/sys/sunrpc
> nfs_debug nfsd_debug nlm_debug rpc_debug tcp_slot_table_entries
> udp_slot_table_entries

Sorry, I meant 'nlm_debug'.

Cheers,
Trond

2005-12-18 20:01:08

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sun, Dec 18, 2005 at 02:31:07PM -0500, Trond Myklebust wrote:
> On Sun, 2005-12-18 at 13:01 -0500, Ryan Richter wrote:
> > Code: 48 39 78 18 75 1c 8b 86 8c 00 00 00 a8 01 74 12 83 c8 02 89
> > RIP <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62} RSP <ffff81007dfade70>
> > CR2: 0000000000000018
>
> Looks like the global lock list is corrupted. Could you cat the contents
> of /proc/locks?

$ cat /proc/locks
1: POSIX ADVISORY WRITE 1657 00:0e:1771273 0 EOF
2: FLOCK ADVISORY WRITE 1486 00:0e:1770759 0 EOF
3: FLOCK ADVISORY WRITE 1478 00:0e:1770399 0 EOF

> > Every machine with a dead lockd has had this oops. Other stuff that
> > looks related (these came after the oops, a few days later):
>
> Those errors are unrelated. These errors come from the server.
>
> > lockd: unexpected unlock status: 1
> > lockd: weird return 7 for CANCEL call
>
> Error "7" is the equivalent of "ESTALE" (stale filehandle). That means
> either someone deleted the file you are trying to lock on the server, or
> that a bug caused nfsd to somehow lose track of the file.
>
> I suspect the Error "1" is related to the same issue.

Unfortunately I don't have a date stamp for these messages. I think
they might be from my initial efforts to debug the problem - I deleted
some lock files once or twice. None of the other machines had these
messages.

>
> > > Finally, please do
> > >
> > > echo 1 > /proc/sys/sunrpc/rpc_lockd
> > > then unmount one of your NFS partitions, and then mount it again.
> >
> > That file doesn't exist.
> >
> > $ ls /proc/sys/sunrpc
> > nfs_debug nfsd_debug nlm_debug rpc_debug tcp_slot_table_entries
> > udp_slot_table_entries
>
> Sorry, I meant 'nlm_debug'.

OK, I did that and unmounted and remounted the home directories.
There's still no lockd process. I had someone using kde log in, and the
login hung for a couple of minutes but then proceeded - it didn't do
that before. There were no kernel messages.

-ryan

2005-12-18 22:24:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

NLM: Fix Oops in nlmclnt_mark_reclaim()

When mixing -olock and -onolock mounts on the same client, we have to
check that fl->fl_u.nfs_fl.owner is set before dereferencing it.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/lockd/clntlock.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/lockd/clntlock.c b/fs/lockd/clntlock.c
index 006bb9e..3eaf6e7 100644
--- a/fs/lockd/clntlock.c
+++ b/fs/lockd/clntlock.c
@@ -157,6 +157,8 @@ void nlmclnt_mark_reclaim(struct nlm_hos
inode = fl->fl_file->f_dentry->d_inode;
if (inode->i_sb->s_magic != NFS_SUPER_MAGIC)
continue;
+ if (fl->fl_u.nfs_fl.owner == NULL)
+ continue;
if (fl->fl_u.nfs_fl.owner->host != host)
continue;
if (!(fl->fl_u.nfs_fl.flags & NFS_LCK_GRANTED))
@@ -226,6 +228,8 @@ restart:
inode = fl->fl_file->f_dentry->d_inode;
if (inode->i_sb->s_magic != NFS_SUPER_MAGIC)
continue;
+ if (fl->fl_u.nfs_fl.owner == NULL)
+ continue;
if (fl->fl_u.nfs_fl.owner->host != host)
continue;
if (!(fl->fl_u.nfs_fl.flags & NFS_LCK_RECLAIM))


Attachments:
linux-2.6.15-49-fix_nlm_recovery_oops.dif (1.06 kB)

2005-12-18 22:44:15

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sun, Dec 18, 2005 at 05:24:24PM -0500, Trond Myklebust wrote:
> On Sun, 2005-12-18 at 15:00 -0500, Ryan Richter wrote:
> > On Sun, Dec 18, 2005 at 02:31:07PM -0500, Trond Myklebust wrote:
> > > On Sun, 2005-12-18 at 13:01 -0500, Ryan Richter wrote:
> > > > Code: 48 39 78 18 75 1c 8b 86 8c 00 00 00 a8 01 74 12 83 c8 02 89
> > > > RIP <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62} RSP <ffff81007dfade70>
> > > > CR2: 0000000000000018
> > >
> > > Looks like the global lock list is corrupted. Could you cat the contents
> > > of /proc/locks?
> >
> > $ cat /proc/locks
> > 1: POSIX ADVISORY WRITE 1657 00:0e:1771273 0 EOF
> > 2: FLOCK ADVISORY WRITE 1486 00:0e:1770759 0 EOF
> > 3: FLOCK ADVISORY WRITE 1478 00:0e:1770399 0 EOF
>
> OK. I think this client patch ought to fix the Oopses. It should apply
> to a 2.6.14 kernel as well as 2.6.15-rc5.

Thanks, I'm testing it now, although I'm not sure how much testing it
will get before the new year.

Cheers,
-ryan

2005-12-19 18:51:16

by Dan Stromberg

[permalink] [raw]
Subject: Re: [NFS] Re: lockd: couldn't create RPC handle for (host)



Sometimes what'll happen is that the daemon (or kernel component) that
services requests on an RPC port will die or otherwise stop servicing
requests, but I believe there's nothing in the portmapper or rpcbind
daemon(s) that will periodically poll to see if those requests are still
being serviced.

I'd hop over to another machine that is advertising the same service,
and use this URL:

http://dcs.nac.uci.edu/~strombrg/What-program-is-active-on-that-port.html

...to see what -should- be handling those requests, and then check on
that same facility on the host having trouble.

On Sat, 2005-12-17 at 00:59 -0500, Ryan Richter wrote:
> On Sat, Dec 17, 2005 at 12:32:57AM -0500, Trond Myklebust wrote:
> > On Fri, 2005-12-16 at 18:58 -0500, Ryan Richter wrote:
> > > On Fri, Dec 16, 2005 at 06:49:05PM -0500, Trond Myklebust wrote:
> > > > On Fri, 2005-12-16 at 15:55 -0500, Ryan Richter wrote:
> > > > > Hi, nfs locking stopped working on my file server running 2.6.15-rc5
> > > > > today. All clients that try locking operations hang, and I get the
> > > > > message from the server:
> > > > >
> > > > > lockd: couldn't create RPC handle for w.x.y.z
> > > >
> > > > Points either to a client which is not responding to callbacks, or an
> > > > OOM situation on the server.
> > > >
> > > > Does 'rpcinfo -u w.x.y.z 100021' work from the server?
> > >
> > > No.
> > >
> > > $ rpcinfo -u jacquere 100021
> > > rpcinfo: RPC: Timed out
> > > program 100021 version 0 is not available
> > > zsh: exit 1 rpcinfo -u jacquere 100021
> > >
> > > So I see now lockd is not present on the client. But...
> > >
> > > $ rpcinfo -p jacquere
> > > program vers proto port
> > > 100000 2 tcp 111 portmapper
> > > 100000 2 udp 111 portmapper
> > > 100021 1 udp 32768 nlockmgr
> > > 100021 3 udp 32768 nlockmgr
> > > 100021 4 udp 32768 nlockmgr
> > > 100024 1 udp 867 status
> > > 100024 1 tcp 870 status
> > >
> > > So what does that mean?
> >
> > Looks to me as if port 111 is pingable (since you can talk to the
> > portmapper), but port 32768 is not. Are you using port filtering or
> > firewalling anywhere (on the client, server, or switches)?
>
> There's no filtering between the two. I get this on the machine itself:
> $ rpcinfo -u localhost 100021
> rpcinfo: RPC: Timed out
> program 100021 version 0 is not available
> zsh: exit 1 rpcinfo -u localhost 100021
>
> There's no lockd process running on this client machine anymore.
>
> On the server:
>
> $ rpcinfo -u localhost 100021
> program 100021 version 1 ready and waiting
> rpcinfo: RPC: Program/version mismatch; low version = 1, high version =4
> program 100021 version 2 is not available
> program 100021 version 3 ready and waiting
> program 100021 version 4 ready and waiting
> zsh: exit 1 rpcinfo -u localhost 100021
>
> Also neither machine is anywhere close to OOM.
>
> -ryan
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>

2006-01-03 19:25:43

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sun, Dec 18, 2005 at 05:24:24PM -0500, Trond Myklebust wrote:
> On Sun, 2005-12-18 at 15:00 -0500, Ryan Richter wrote:
> > On Sun, Dec 18, 2005 at 02:31:07PM -0500, Trond Myklebust wrote:
> > > On Sun, 2005-12-18 at 13:01 -0500, Ryan Richter wrote:
> > > > Code: 48 39 78 18 75 1c 8b 86 8c 00 00 00 a8 01 74 12 83 c8 02 89
> > > > RIP <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62} RSP <ffff81007dfade70>
> > > > CR2: 0000000000000018
> > >
> > > Looks like the global lock list is corrupted. Could you cat the contents
> > > of /proc/locks?
> >
> > $ cat /proc/locks
> > 1: POSIX ADVISORY WRITE 1657 00:0e:1771273 0 EOF
> > 2: FLOCK ADVISORY WRITE 1486 00:0e:1770759 0 EOF
> > 3: FLOCK ADVISORY WRITE 1478 00:0e:1770399 0 EOF
>
> OK. I think this client patch ought to fix the Oopses. It should apply
> to a 2.6.14 kernel as well as 2.6.15-rc5.

I applied this patch to 2.6.14.3 and booted it before I left in
December. Now that I'm back, I see that the problem has returned -
lockd is no longer running, and

general protection fault: 0000 [1]
CPU 0
Modules linked in:
Pid: 1174, comm: lockd Not tainted 2.6.14.3 #1
RIP: 0010:[<ffffffff801dbda3>] <ffffffff801dbda3>{nlmclnt_mark_reclaim+67}
RSP: 0018:ffff81007dfade70 EFLAGS: 00010206
RAX: 3168647300000000 RBX: ffff81007ec1cb40 RCX: ffffffff803e4650
RDX: ffff81007f31b0a0 RSI: ffff81007f31b098 RDI: ffff81007ec1cb40
RBP: ffff81000327e200 R08: 00000000fffffffa R09: 0000000000000001
R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffffffff803ec460 R15: ffff81007df5f014
FS: 00002aaaab00c4a0(0000) GS:ffffffff804b6800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000512000 CR3: 000000007c8d0000 CR4: 00000000000006e0
Process lockd (pid: 1174, threadinfo ffff81007dfac000, task ffff81007eea61c0)
Stack: ffffffff801dbe6b ffff81007ec1cb40 ffffffff801e3d8c 3256cc8445030002
0000000000000000 ffff81007df5bc68 ffff81007df5bc00 ffffffff803ed4e0
ffff81007df5bca0 ffff81007df5bc68
Call Trace:<ffffffff801dbe6b>{nlmclnt_recovery+139} <ffffffff801e3d8c>{nlm4svc_proc_sm_notify+188}
<ffffffff8034c5e4>{svc_process+884} <ffffffff8012ab40>{default_wake_function+0}
<ffffffff801dde00>{lockd+352} <ffffffff801ddca0>{lockd+0}
<ffffffff8010e352>{child_rip+8} <ffffffff801ddca0>{lockd+0}
<ffffffff801ddca0>{lockd+0} <ffffffff8010e34a>{child_rip+0}


Code: 48 39 78 18 75 17 8b 86 8c 00 00 00 a8 01 74 0d 83 c8 02 89
RIP <ffffffff801dbda3>{nlmclnt_mark_reclaim+67} RSP <ffff81007dfade70>

-ryan

2006-01-18 00:46:23

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sun, Dec 18, 2005 at 05:24:24PM -0500, Trond Myklebust wrote:
> On Sun, 2005-12-18 at 15:00 -0500, Ryan Richter wrote:
> > On Sun, Dec 18, 2005 at 02:31:07PM -0500, Trond Myklebust wrote:
> > > On Sun, 2005-12-18 at 13:01 -0500, Ryan Richter wrote:
> > > > Code: 48 39 78 18 75 1c 8b 86 8c 00 00 00 a8 01 74 12 83 c8 02 89
> > > > RIP <ffffffff801dbd9e>{nlmclnt_mark_reclaim+62} RSP <ffff81007dfade70>
> > > > CR2: 0000000000000018
> > >
> > > Looks like the global lock list is corrupted. Could you cat the contents
> > > of /proc/locks?
> >
> > $ cat /proc/locks
> > 1: POSIX ADVISORY WRITE 1657 00:0e:1771273 0 EOF
> > 2: FLOCK ADVISORY WRITE 1486 00:0e:1770759 0 EOF
> > 3: FLOCK ADVISORY WRITE 1478 00:0e:1770399 0 EOF
>
> OK. I think this client patch ought to fix the Oopses. It should apply
> to a 2.6.14 kernel as well as 2.6.15-rc5.
>
> Cheers,
> Trond
>

> NLM: Fix Oops in nlmclnt_mark_reclaim()

This is still a problem with 2.6.15.

Unable to handle kernel paging request at 000000000000d08c RIP:
<ffffffff801c3538>{nlmclnt_mark_reclaim+67}
PGD 7c4af067 PUD 7d194067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in:
Pid: 1222, comm: lockd Not tainted 2.6.15 #5
RIP: 0010:[<ffffffff801c3538>] <ffffffff801c3538>{nlmclnt_mark_reclaim+67}
RSP: 0018:ffff81007ddf9e70 EFLAGS: 00010206
RAX: 000000000000d074 RBX: ffff81007f365240 RCX: ffff81007f973e80
RDX: ffff81007f973e88 RSI: 000000000000003b RDI: ffff81007f365240
RBP: ffff81007ecaec00 R08: 00000000fffffffa R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000010 R15: ffffffff803bb5e0
FS: 00002aaaab00c4a0(0000) GS:ffffffff80489800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000d08c CR3: 000000007c4eb000 CR4: 00000000000006e0
Process lockd (pid: 1222, threadinfo ffff81007ddf8000, task ffff81007ed341c0)
Stack: ffffffff801c35ef ffff81007f365240 ffffffff801ca2ba 3256cc8460030002
0000000000000000 0000000000000004 ffff81007f34b000 ffff81007f34b0a0
ffffffff803bc660 ffff81007ddf0014
Call Trace:<ffffffff801c35ef>{nlmclnt_recovery+139} <ffffffff801ca2ba>{nlm4svc_proc_sm_notify+188}
<ffffffff80314b07>{svc_process+871} <ffffffff801c519d>{lockd+344}
<ffffffff801c5045>{lockd+0} <ffffffff801c5045>{lockd+0}
<ffffffff8010dfaa>{child_rip+8} <ffffffff801c5045>{lockd+0}
<ffffffff801c5045>{lockd+0} <ffffffff8010dfa2>{child_rip+0}


Code: 48 39 78 18 75 13 8b 81 8c 00 00 00 a8 01 74 09 83 c8 02 89
RIP <ffffffff801c3538>{nlmclnt_mark_reclaim+67} RSP <ffff81007ddf9e70>
CR2: 000000000000d08c

And another machine

Unable to handle kernel paging request at 0000000000009090 RIP:
<ffffffff801c3538>{nlmclnt_mark_reclaim+67}
PGD 7bc89067 PUD 7bc8a067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in:
Pid: 1220, comm: lockd Not tainted 2.6.15 #5
RIP: 0010:[<ffffffff801c3538>] <ffffffff801c3538>{nlmclnt_mark_reclaim+67}
RSP: 0018:ffff81007de17e70 EFLAGS: 00010206
RAX: 0000000000009078 RBX: ffff810079e35f00 RCX: ffff81007f979e80
RDX: ffff81007f979e88 RSI: 000000000000003b RDI: ffff810079e35f00
RBP: ffff81007eca9200 R08: 00000000fffffffa R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000010 R15: ffffffff803bb5e0
FS: 00002aaaab00c4a0(0000) GS:ffffffff80489800(0000) knlGS:0000000056701920
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000009090 CR3: 000000007bc7e000 CR4: 00000000000006e0
Process lockd (pid: 1220, threadinfo ffff81007de16000, task ffff81007ed261c0)
Stack: ffffffff801c35ef ffff810079e35f00 ffffffff801ca2ba 3256cc8473030002
0000000000000000 0000000000000004 ffff81007f0e3000 ffff81007f0e30a0
ffffffff803bc660 ffff81007ddf8014
Call Trace:<ffffffff801c35ef>{nlmclnt_recovery+139} <ffffffff801ca2ba>{nlm4svc_proc_sm_notify+188}
<ffffffff80314b07>{svc_process+871} <ffffffff801c519d>{lockd+344}
<ffffffff801c5045>{lockd+0} <ffffffff801c5045>{lockd+0}
<ffffffff8010dfaa>{child_rip+8} <ffffffff801c5045>{lockd+0}
<ffffffff801c5045>{lockd+0} <ffffffff8010dfa2>{child_rip+0}


Code: 48 39 78 18 75 13 8b 81 8c 00 00 00 a8 01 74 09 83 c8 02 89
RIP <ffffffff801c3538>{nlmclnt_mark_reclaim+67} RSP <ffff81007de17e70>
CR2: 0000000000009090

Lockd is no longer running etc.

-ryan

2006-02-27 22:02:22

by Ryan Richter

[permalink] [raw]
Subject: Re: lockd: couldn't create RPC handle for (host)

On Sun, Dec 18, 2005 at 05:24:24PM -0500, Trond Myklebust wrote:
> OK. I think this client patch ought to fix the Oopses. It should apply
> to a 2.6.14 kernel as well as 2.6.15-rc5.

Oops still occurs with 2.6.15.4.

Unable to handle kernel paging request at 0000000100000011 RIP:
<ffffffff801c3424>{nlmclnt_mark_reclaim+67}
PGD 7bc9b067 PUD 0
Oops: 0000 [1]
CPU 0
Modules linked in:
Pid: 2210, comm: lockd Not tainted 2.6.15.4 #3
RIP: 0010:[<ffffffff801c3424>] <ffffffff801c3424>{nlmclnt_mark_reclaim+67}
RSP: 0018:ffff81007ddf9e70 EFLAGS: 00010206
RAX: 00000000fffffff9 RBX: ffff81007f35a540 RCX: ffff81007ecb5980
RDX: ffff81007ecb5988 RSI: 0000000000000053 RDI: ffff81007f35a540
RBP: ffff81007f8e1800 R08: 00000000fffffffa R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000010 R15: ffffffff803bb460
FS: 00002aaaab00c4a0(0000) GS:ffffffff80489800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000100000011 CR3: 000000007bc9a000 CR4: 00000000000006e0
Process lockd (pid: 2210, threadinfo ffff81007ddf8000, task ffff81007ed261c0)
Stack: ffffffff801c34db ffff81007f35a540 ffffffff801ca1a6 3256cc8459020002
0000000000000000 0000000000000004 ffff81007ec94000 ffff81007ec940a0
ffffffff803bc4e0 ffff810061626014
Call Trace:<ffffffff801c34db>{nlmclnt_recovery+139} <ffffffff801ca1a6>{nlm4svc_proc_sm_notify+188}
<ffffffff803149d7>{svc_process+871} <ffffffff801c5089>{lockd+344}
<ffffffff801c4f31>{lockd+0} <ffffffff801c4f31>{lockd+0}
<ffffffff8010dfaa>{child_rip+8} <ffffffff801c4f31>{lockd+0}
<ffffffff801c4f31>{lockd+0} <ffffffff8010dfa2>{child_rip+0}


Code: 48 39 78 18 75 13 8b 81 8c 00 00 00 a8 01 74 09 83 c8 02 89
RIP <ffffffff801c3424>{nlmclnt_mark_reclaim+67} RSP <ffff81007ddf9e70>
CR2: 0000000100000011

-ryan