2015-11-30 16:30:17

by Peter Thurner

[permalink] [raw]
Subject: NFS Kernel Panics

Hi guys,

I'm running the following Setup on Ubuntu 14.04 for both Server and Clients:


== NFS Server with /etc/exports:

/var/www/ 172.16.1.254(rw,no_root_squash,sync,no_subtree_check)
172.16.1.184(rw,no_root_squash,sync,no_subtree_check)
172.16.0.120(rw,no_root_squash,sync,no_subtree_check)
172.16.0.193(rw,no_root_squash,sync,no_subtree_check)

Version: 1:1.2.8-6ubuntu1.2


== Four NFS Clients with fstab:

alpha:/var/www /var/www nfs4
nosharecache,fsc=example_web,noatime,tcp,bg,nosuid,rsize=32768,wsize=32768,soft,proto=tcp
0 0

On the Clients i'm using cachefilesd:

/var/cache/cachefilesd/loopimage.img
/var/cache/cachefilesd/srv ext4
loop,rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0
0

root@web1:~# cat /etc/cachefilesd.conf
dir /var/cache/cachefilesd/srv
tag nfs_filesystem_cache
brun 20%
frun 10%
bcull 10%
fcull 7%
bstop 5%
fstop 3%


== Problem

Both server and clients experience random kernel Panics. Of the five
machines, around one dies per die. They all run on Amazon AWS as
m4.large instances. When I set

rpcdebug -m nfsd -s all
rpcdebug -m rpc -s all

The messages before the crash (this time on the NFS server) are:

```
Nov 30 13:49:54 nfs-master kernel: [38232.649545] nfsd_dispatch: vers 4
proc 1
Nov 30 13:49:54 nfs-master kernel: [38232.649547] nfsv4 compound op
#1/3: 22 (OP_PUTFH)
Nov 30 13:49:54 nfs-master kernel: [38232.649548] nfsd: fh_verify(32:
81060001 0c7791ab ab46dd87 663ae28a 6877949f 2802898e)
Nov 30 13:49:54 nfs-master kernel: [38232.649552] nfsv4 compound op
ffff8802026c8080 opcnt 3 #1: 22: status 0
Nov 30 13:49:54 nfs-master kernel: [38232.649553] nfsv4 compound op
#2/3: 4 (OP_CLOSE)
Nov 30 13:49:54 nfs-master kernel: [38232.649554] NFSD: nfsd4_close on
file objectLinksShadow.png
Nov 30 13:49:54 nfs-master kernel: [38232.649556] NFSD:
nfs4_preprocess_seqid_op: seqid=818421 stateid =
(565bb0a0/00000001/00083f05/00000001)
Nov 30 13:49:54 nfs-master kernel: [38232.649557] renewing client
(clientid 565bb0a0/00000001)
Nov 30 13:49:54 nfs-master kernel: [38232.649558] NFSD:
move_to_close_lru nfs4_openowner ffff8800373b8000
Nov 30 13:49:54 nfs-master kernel: [38232.649559] nfsv4 compound op
ffff8802026c8080 opcnt 3 #2: 4: status 0
Nov 30 13:49:54 nfs-master kernel: [38232.649560] nfsv4 compound op
#3/3: 9 (OP_GETATTR)
Nov 30 13:49:54 nfs-master kernel: [38232.649562] nfsd: fh_verify(32:
81060001 0c7791ab ab46dd87 663ae28a 6877949f 2802898e)
Nov 30 13:49:54 nfs-master kernel: [38232.649564] nfsv4 compound op
ffff8802026c8080 opcnt 3 #3: 9: status 0
Nov 30 13:49:54 nfs-master kernel: [38232.649565] nfsv4 compound returned 0
Nov 30 13:49:54 nfs-master kernel: [38232.649570] svc: socket
ffff8800e929d000 sendto([ffff8801e07ae000 136... ], 136) = 136 (addr
172.16.0.120, port=958)
Nov 30 13:49:54 nfs-master kernel: [38232.649571] svc: server
ffff880202142000 waiting for data (to = 900000)
Nov 30 13:49:54 nfs-master rsyslogd: [origin software="rsyslogd"
swVersion="7.4.4" x-pid="939" x-info="http://www.rsyslog.com"] exiting
on signal 15.

Server is rebooting here

Nov 30 13:50:34 nfs-master rsyslogd: [origin software="rsyslogd"
swVersion="7.4.4" x-pid="951" x-info="http://www.rsyslog.com"] start
Nov 30 13:50:34 nfs-master rsyslogd-2307: warning: ~ action is
deprecated, consider using the 'stop' statement instead [try
http://www.rsyslog.com/e/2307 ]
Nov 30 13:50:34 nfs-master rsyslogd: rsyslogd's groupid changed to 104
Nov 30 13:50:34 nfs-master rsyslogd: rsyslogd's userid changed to 101
Nov 30 13:50:34 nfs-master kernel: [ 0.000000] Initializing cgroup
subsys cpuset
Nov 30 13:50:34 nfs-master kernel: [ 0.000000] Initializing cgroup
subsys cpu
Nov 30 13:50:34 nfs-master kernel: [ 0.000000] Initializing cgroup
subsys cpuacct

```










2015-11-30 23:07:06

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFS Kernel Panics

On Mon, Nov 30, 2015 at 05:20:23PM +0100, Peter Thurner wrote:
> Hi guys,
>
> I'm running the following Setup on Ubuntu 14.04 for both Server and Clients:

I don't know what kernel version that translates to.

Ideally this would either get reported to Ubuntu, or reproduced with an
upstream kernel before getting reported here.

>
>
> == NFS Server with /etc/exports:
>
> /var/www/ 172.16.1.254(rw,no_root_squash,sync,no_subtree_check)
> 172.16.1.184(rw,no_root_squash,sync,no_subtree_check)
> 172.16.0.120(rw,no_root_squash,sync,no_subtree_check)
> 172.16.0.193(rw,no_root_squash,sync,no_subtree_check)
>
> Version: 1:1.2.8-6ubuntu1.2
>
>
> == Four NFS Clients with fstab:
>
> alpha:/var/www /var/www nfs4
> nosharecache,fsc=example_web,noatime,tcp,bg,nosuid,rsize=32768,wsize=32768,soft,proto=tcp
> 0 0
>
> On the Clients i'm using cachefilesd:
>
> /var/cache/cachefilesd/loopimage.img
> /var/cache/cachefilesd/srv ext4
> loop,rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0
> 0
>
> root@web1:~# cat /etc/cachefilesd.conf
> dir /var/cache/cachefilesd/srv
> tag nfs_filesystem_cache
> brun 20%
> frun 10%
> bcull 10%
> fcull 7%
> bstop 5%
> fstop 3%
>
>
> == Problem
>
> Both server and clients experience random kernel Panics. Of the five
> machines, around one dies per die.

per day?

> They all run on Amazon AWS as
> m4.large instances. When I set
>
> rpcdebug -m nfsd -s all
> rpcdebug -m rpc -s all
>
> The messages before the crash (this time on the NFS server) are:
>
> ```
> Nov 30 13:49:54 nfs-master kernel: [38232.649545] nfsd_dispatch: vers 4
> proc 1
> Nov 30 13:49:54 nfs-master kernel: [38232.649547] nfsv4 compound op
> #1/3: 22 (OP_PUTFH)
> Nov 30 13:49:54 nfs-master kernel: [38232.649548] nfsd: fh_verify(32:
> 81060001 0c7791ab ab46dd87 663ae28a 6877949f 2802898e)
> Nov 30 13:49:54 nfs-master kernel: [38232.649552] nfsv4 compound op
> ffff8802026c8080 opcnt 3 #1: 22: status 0
> Nov 30 13:49:54 nfs-master kernel: [38232.649553] nfsv4 compound op
> #2/3: 4 (OP_CLOSE)
> Nov 30 13:49:54 nfs-master kernel: [38232.649554] NFSD: nfsd4_close on
> file objectLinksShadow.png
> Nov 30 13:49:54 nfs-master kernel: [38232.649556] NFSD:
> nfs4_preprocess_seqid_op: seqid=818421 stateid =
> (565bb0a0/00000001/00083f05/00000001)
> Nov 30 13:49:54 nfs-master kernel: [38232.649557] renewing client
> (clientid 565bb0a0/00000001)
> Nov 30 13:49:54 nfs-master kernel: [38232.649558] NFSD:
> move_to_close_lru nfs4_openowner ffff8800373b8000
> Nov 30 13:49:54 nfs-master kernel: [38232.649559] nfsv4 compound op
> ffff8802026c8080 opcnt 3 #2: 4: status 0
> Nov 30 13:49:54 nfs-master kernel: [38232.649560] nfsv4 compound op
> #3/3: 9 (OP_GETATTR)
> Nov 30 13:49:54 nfs-master kernel: [38232.649562] nfsd: fh_verify(32:
> 81060001 0c7791ab ab46dd87 663ae28a 6877949f 2802898e)
> Nov 30 13:49:54 nfs-master kernel: [38232.649564] nfsv4 compound op
> ffff8802026c8080 opcnt 3 #3: 9: status 0
> Nov 30 13:49:54 nfs-master kernel: [38232.649565] nfsv4 compound returned 0
> Nov 30 13:49:54 nfs-master kernel: [38232.649570] svc: socket
> ffff8800e929d000 sendto([ffff8801e07ae000 136... ], 136) = 136 (addr
> 172.16.0.120, port=958)
> Nov 30 13:49:54 nfs-master kernel: [38232.649571] svc: server
> ffff880202142000 waiting for data (to = 900000)

This all looks pretty normal to me.

> Nov 30 13:49:54 nfs-master rsyslogd: [origin software="rsyslogd"
> swVersion="7.4.4" x-pid="939" x-info="http://www.rsyslog.com"] exiting
> on signal 15.

That's SIGTERM. No idea if that means anything.

Sorry, I don't see anything much to go on here. Is there a console that
might have anything more? I'm not very familiar with AWS.

--b.

> Server is rebooting here
>
> Nov 30 13:50:34 nfs-master rsyslogd: [origin software="rsyslogd"
> swVersion="7.4.4" x-pid="951" x-info="http://www.rsyslog.com"] start
> Nov 30 13:50:34 nfs-master rsyslogd-2307: warning: ~ action is
> deprecated, consider using the 'stop' statement instead [try
> http://www.rsyslog.com/e/2307 ]
> Nov 30 13:50:34 nfs-master rsyslogd: rsyslogd's groupid changed to 104
> Nov 30 13:50:34 nfs-master rsyslogd: rsyslogd's userid changed to 101
> Nov 30 13:50:34 nfs-master kernel: [ 0.000000] Initializing cgroup
> subsys cpuset
> Nov 30 13:50:34 nfs-master kernel: [ 0.000000] Initializing cgroup
> subsys cpu
> Nov 30 13:50:34 nfs-master kernel: [ 0.000000] Initializing cgroup
> subsys cpuacct
>
> ```
>
>
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html