2007-10-15 17:43:27

by Andrea Righi

[permalink] [raw]
Subject: nfsd closes port 2049

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


Attachments:
config (56.93 kB)
(No filename) (314.00 B)
(No filename) (140.00 B)
Download all attachments

2007-10-15 18:05:01

by Talpey, Thomas

[permalink] [raw]
Subject: Re: nfsd closes port 2049

>Oct 13 05:20:56 node0101 kernel: nfsd_acceptable failed at ffff8100c7873700

Sounds like the filesystem became unexported, or unexportable
due to turning off an "x" bit somewhere along the directory tree.
Were all these clients accessing a single mountpoint? Check
/etc/exports, and that directory.

Tom.




At 12:57 PM 10/15/2007, Andrea Righi wrote:
>Hi all,
>
>I'm trying to debug a weird problem with nfsd on a 2.6.16.27-0.6-smp
>kernel.
>
>1 server: SuSE SLES 10 x86_64, config attached
>256 clients: RHEL4 Update 4 2.6.9-42.ELsmp x86_64
>
>Using nfs v3.
>
>The clients have been happily talking to the server for several days
>without incident.
>
>The weird thing is that at a certain point the socket opened on port
>2049 on the NFS server is being closed for unknown reasons (or better
>for unknown reasons for me!). With unknown reasons I mean that I don't
>see any critical error message in the logs, even with debug verbosity
>enabled. I've enabled the max debug verbosity with:
>
>echo 2147483647 > /proc/sys/sunrpc/nfsd_debug
>
>The failures in the accept()s confirms that the server socket is working
>fine and suddenly it has been closed (I did some attempts with a simple
>netcat from localhost to check the socket availability):
>
># bzcat /var/log/messages-20071013.bz2 | grep accept
>Oct 13 00:30:05 node0101 kernel: svc: tcp_accept ffff81015ea1d380 sock
>ffff81015ce1a780
>Oct 13 00:30:05 node0101 kernel: svc: tcp_accept ffff8100c7997c00 allocated
>Oct 13 00:30:06 node0101 kernel: svc: tcp_accept ffff81015ea1d380 sock
>ffff81015ce1a780
>Oct 13 00:30:06 node0101 kernel: svc: tcp_accept ffff8100c7997c00 allocated
>Oct 13 00:32:30 node0101 kernel: svc: tcp_accept ffff81015ea1d380 sock
>ffff81015ce1a780
>Oct 13 00:32:30 node0101 kernel: svc: tcp_accept ffff8100c7997c00 allocated
>Oct 13 00:32:31 node0101 kernel: svc: tcp_accept ffff81015ea1d380 sock
>ffff81015ce1a780
>Oct 13 00:32:31 node0101 kernel: svc: tcp_accept ffff8100c7997c00 allocated
>Oct 13 05:20:56 node0101 kernel: nfsd_acceptable failed at ffff8100c7873700
>Oct 13 05:51:06 node0101 kernel: nfsd_acceptable failed at ffff8100cba472f0
>Oct 13 09:51:11 node0101 kernel: nfsd_acceptable failed at ffff8100c6dbba40
>Oct 13 11:04:33 node0101 kernel: nfsd_acceptable failed at ffff8100ce485a40
>Oct 13 11:51:30 node0101 kernel: nfsd_acceptable failed at ffff8100c95552f0
>
>The other strange thing is that the server receives a lot of requests to
>close the connection from the clients, for example:
>
>node0101:~ # bzcat /var/log/messages-20071013.bz2 | grep "close 1$"
>...
>Oct 12 18:07:15 node0101 kernel: svc: tcp_recv ffff810114f16dc0 data 1
>conn 0 close 1
>Oct 12 18:07:54 node0101 kernel: svc: tcp_recv ffff810114e07c80 data 0
>conn 0 close 1
>Oct 12 18:10:58 node0101 kernel: svc: tcp_recv ffff810114fcc880 data 1
>conn 0 close 1
>Oct 12 18:11:54 node0101 kernel: svc: tcp_recv ffff81010e940a80 data 0
>conn 0 close 1
>Oct 12 18:13:40 node0101 kernel: svc: tcp_recv ffff81013e4cd6c0 data 1
>conn 0 close 1
>Oct 12 18:13:45 node0101 kernel: svc: tcp_recv ffff810111acd8c0 data 1
>conn 0 close 1
>Oct 12 18:15:25 node0101 kernel: svc: tcp_recv ffff810112431c80 data 0
>conn 0 close 1
>Oct 12 18:16:00 node0101 kernel: svc: tcp_recv ffff810114f58980 data 1
>conn 0 close 1
>Oct 12 18:16:17 node0101 kernel: svc: tcp_recv ffff810114f58180 data 1
>conn 0 close 1
>Oct 12 18:16:27 node0101 kernel: svc: tcp_recv ffff81011130cc80 data 1
>conn 0 close 1
>Oct 12 18:16:37 node0101 kernel: svc: tcp_recv ffff81011130ca80 data 1
>conn 0 close 1
>Oct 12 18:17:03 node0101 kernel: svc: tcp_recv ffff81011130c880 data 1
>conn 0 close 1
>Oct 12 18:20:18 node0101 kernel: svc: tcp_recv ffff8100d67be9c0 data 1
>conn 0 close 1
>Oct 12 18:22:52 node0101 kernel: svc: tcp_recv ffff810111a23bc0 data 0
>conn 0 close 1
>...
>
>Is it an expected behaviour or a potential symptom of a problem? Which
>info could I search in the logs?
>
>Any help appreciated.
>
>Thanks,
>-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-15 18:27:38

by Andrea Righi

[permalink] [raw]
Subject: Re: nfsd closes port 2049

Talpey, Thomas wrote:
>> Oct 13 05:20:56 node0101 kernel: nfsd_acceptable failed at ffff8100c7873700
>
> Sounds like the filesystem became unexported, or unexportable
> due to turning off an "x" bit somewhere along the directory tree.
> Were all these clients accessing a single mountpoint? Check
> /etc/exports, and that directory.

Thomas,

thanks for the quick reply. Here is the /etc/exports (all clients are
accessing the same mountpoint):

node0101:~ # cat /etc/exports
# See the exports(5) manpage for a description of the syntax of this file.
# This file contains a list of all directories that are to be exported to
# other computers via NFS (Network File System).
# This file used by rpc.nfsd and rpc.mountd. See their manpages for details
# on how make changes in this file effective.

/eni01 *.eni01.cineca.it(rw,no_root_squash,async,fsid=745)

And:

node0101:~ # exportfs -v
/eni01 *.eni01.cineca.it(rw,async,wdelay,no_root_squash,fsid=745)

The expoted directory is still available during the faulty condition.

It's a gpfs mountpoint exported to the clients by NFS (don't think gpfs
is an issue, I've used the same configuration in a lot of similar cases
without any problem).

node0101:~ # mount
/dev/mapper/root_vg-root_lv on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sda1 on /boot type ext2 (rw,acl,user_xattr)
/dev/mapper/root_vg-home_lv on /home type ext3 (rw,acl,user_xattr)
/dev/mapper/root_vg-tmp_lv on /tmp type ext3 (rw,acl,user_xattr)
/dev/mapper/root_vg-usr_lv on /usr type ext3 (rw,acl,user_xattr)
/dev/mapper/root_vg-var_lv on /var type ext3 (rw,acl,user_xattr)
/dev/gpfs_eni01 on /eni01 type gpfs (rw,mtime,quota=userquota;groupquota;filesetquota,dev=gpfs_eni01,autostart)
nfsd on /proc/fs/nfsd type nfsd (rw)
node0101:~ # df -hT /eni01
Filesystem Type Size Used Avail Use% Mounted on
/dev/gpfs_eni01
gpfs 18T 292G 18T 2% /eni01
node0101:~ # stat /eni01/
File: `/eni01/'
Size: 32768 Blocks: 64 IO Block: 32768 directory
Device: 11h/17d Inode: 3 Links: 17
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2007-10-15 20:08:21.000000000 +0200
Modify: 2007-10-15 15:43:08.846245519 +0200
Change: 2007-10-15 15:43:08.846245519 +0200

BTW I see some dropped packets in the network interfaces used to export
the filesystem (bond0):

node0101:~ # ifconfig bond0
bond0 Link encap:Ethernet HWaddr 00:15:17:23:F1:29
inet addr:10.130.0.11 Bcast:10.130.255.255 Mask:255.255.0.0
inet6 addr: fe80::215:17ff:fe23:f129/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
RX packets:594282923 errors:0 dropped:2253 overruns:0 frame:0
TX packets:549363611 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:149828096309 (142887.2 Mb) TX bytes:199133526153 (189908.5 Mb)

node0101:~ # ifconfig eth3
eth3 Link encap:Ethernet HWaddr 00:15:17:23:F1:29
inet6 addr: fe80::215:17ff:fe23:f129/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
RX packets:320861616 errors:0 dropped:1384 overruns:0 frame:0
TX packets:265916198 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:103186260542 (98406.0 Mb) TX bytes:98671309326 (94100.2 Mb)
Base address:0x7420 Memory:e7960000-e7980000

node0101:~ # ifconfig eth5
eth5 Link encap:Ethernet HWaddr 00:15:17:23:F1:29
inet6 addr: fe80::215:17ff:fe23:f129/64 Scope:Link
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
RX packets:273428614 errors:0 dropped:869 overruns:0 frame:0
TX packets:283454604 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:46645599519 (44484.7 Mb) TX bytes:100463595893 (95809.5 Mb)
Base address:0x5420 Memory:e7e60000-e7e80000

Could it lead to potential NFS problems (even if it sounds quite strange)?

-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-15 18:40:55

by Talpey, Thomas

[permalink] [raw]
Subject: Re: nfsd closes port 2049

I'm snipping out LKMLfrom the cc, this isn't really a kernel issue.

I can give two suggestions. One, look carefully at whether GPFS had
a momentary issue at the time of any error. The server permission check
would consult it, and any error will cause the export to be denied.

Second, you can add "no_subtree_check" to the export line, which
will bypass this checking. Since you appear to be exporting the root
of the filesystem, it seems there is little need for this checking. In
newer versions of nfs_utils, non-checking is in fact the default. You
can read about the option with "man exports".

Tom.

At 02:23 PM 10/15/2007, Andrea Righi wrote:
>Talpey, Thomas wrote:
>>> Oct 13 05:20:56 node0101 kernel: nfsd_acceptable failed at ffff8100c7873700
>>
>> Sounds like the filesystem became unexported, or unexportable
>> due to turning off an "x" bit somewhere along the directory tree.
>> Were all these clients accessing a single mountpoint? Check
>> /etc/exports, and that directory.
>
>Thomas,
>
>thanks for the quick reply. Here is the /etc/exports (all clients are
>accessing the same mountpoint):
>
>node0101:~ # cat /etc/exports
># See the exports(5) manpage for a description of the syntax of this file.
># This file contains a list of all directories that are to be exported to
># other computers via NFS (Network File System).
># This file used by rpc.nfsd and rpc.mountd. See their manpages for details
># on how make changes in this file effective.
>
>/eni01 *.eni01.cineca.it(rw,no_root_squash,async,fsid=745)
>
>And:
>
>node0101:~ # exportfs -v
>/eni01 *.eni01.cineca.it(rw,async,wdelay,no_root_squash,fsid=745)
>
>The expoted directory is still available during the faulty condition.
>
>It's a gpfs mountpoint exported to the clients by NFS (don't think gpfs
>is an issue, I've used the same configuration in a lot of similar cases
>without any problem).
>
>node0101:~ # mount
>/dev/mapper/root_vg-root_lv on / type ext3 (rw,acl,user_xattr)
>proc on /proc type proc (rw)
>sysfs on /sys type sysfs (rw)
>debugfs on /sys/kernel/debug type debugfs (rw)
>udev on /dev type tmpfs (rw)
>devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
>/dev/sda1 on /boot type ext2 (rw,acl,user_xattr)
>/dev/mapper/root_vg-home_lv on /home type ext3 (rw,acl,user_xattr)
>/dev/mapper/root_vg-tmp_lv on /tmp type ext3 (rw,acl,user_xattr)
>/dev/mapper/root_vg-usr_lv on /usr type ext3 (rw,acl,user_xattr)
>/dev/mapper/root_vg-var_lv on /var type ext3 (rw,acl,user_xattr)
>/dev/gpfs_eni01 on /eni01 type gpfs
>(rw,mtime,quota=userquota;groupquota;filesetquota,dev=gpfs_eni01,autostart)
>nfsd on /proc/fs/nfsd type nfsd (rw)
>node0101:~ # df -hT /eni01
>Filesystem Type Size Used Avail Use% Mounted on
>/dev/gpfs_eni01
> gpfs 18T 292G 18T 2% /eni01
>node0101:~ # stat /eni01/
> File: `/eni01/'
> Size: 32768 Blocks: 64 IO Block: 32768 directory
>Device: 11h/17d Inode: 3 Links: 17
>Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
>Access: 2007-10-15 20:08:21.000000000 +0200
>Modify: 2007-10-15 15:43:08.846245519 +0200
>Change: 2007-10-15 15:43:08.846245519 +0200
>
>BTW I see some dropped packets in the network interfaces used to export
>the filesystem (bond0):
>
>node0101:~ # ifconfig bond0
>bond0 Link encap:Ethernet HWaddr 00:15:17:23:F1:29
> inet addr:10.130.0.11 Bcast:10.130.255.255 Mask:255.255.0.0
> inet6 addr: fe80::215:17ff:fe23:f129/64 Scope:Link
> UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
> RX packets:594282923 errors:0 dropped:2253 overruns:0 frame:0
> TX packets:549363611 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:149828096309 (142887.2 Mb) TX bytes:199133526153
>(189908.5 Mb)
>
>node0101:~ # ifconfig eth3
>eth3 Link encap:Ethernet HWaddr 00:15:17:23:F1:29
> inet6 addr: fe80::215:17ff:fe23:f129/64 Scope:Link
> UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
> RX packets:320861616 errors:0 dropped:1384 overruns:0 frame:0
> TX packets:265916198 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:103186260542 (98406.0 Mb) TX bytes:98671309326 (94100.2 Mb)
> Base address:0x7420 Memory:e7960000-e7980000
>
>node0101:~ # ifconfig eth5
>eth5 Link encap:Ethernet HWaddr 00:15:17:23:F1:29
> inet6 addr: fe80::215:17ff:fe23:f129/64 Scope:Link
> UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
> RX packets:273428614 errors:0 dropped:869 overruns:0 frame:0
> TX packets:283454604 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:46645599519 (44484.7 Mb) TX bytes:100463595893 (95809.5 Mb)
> Base address:0x5420 Memory:e7e60000-e7e80000
>
>Could it lead to potential NFS problems (even if it sounds quite strange)?
>
>-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-10-15 19:43:27

by Andrea Righi

[permalink] [raw]
Subject: Re: nfsd closes port 2049

Talpey, Thomas wrote:
> I'm snipping out LKMLfrom the cc, this isn't really a kernel issue.
>
> I can give two suggestions. One, look carefully at whether GPFS had
> a momentary issue at the time of any error. The server permission check
> would consult it, and any error will cause the export to be denied.
>
> Second, you can add "no_subtree_check" to the export line, which
> will bypass this checking. Since you appear to be exporting the root
> of the filesystem, it seems there is little need for this checking. In
> newer versions of nfs_utils, non-checking is in fact the default. You
> can read about the option with "man exports".

No error logged by GPFS during the latest failure. But I'll try the
no_subtree_check and I'll keep you informed.

Thanks,
-Andrea

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs