Hi all, I'm sure you get this alot, but I couldn't figure out any solution. We have a client/server pair with both 1Gb and 10Gb network interfaces. I can mount the share on the client on the 1Gb interface just fine and interact with it normally. If I unmount and try to mount the share on the 10Gb interface, it will mount but everything after that hangs (like ls or df). The exports entry is the same on the server, i.e.:
#1Gb interface
/data 10.10.10.0/24(rw,no_root_squash,async)
#10Gb interface
/data 128.196.X.X/28(rw,no_root_squash,async)
I turned off iptables for troubleshooting and checked with the NOC here. Using NFSv4 by default and CentOS 6.10 2.6.32 kernel. I had some strange results if i try vers=3 or vers=2, then i can "ls /data" but if I try to "ls /data/subdir" then it hangs again. Now it doesn't even mount if i try with vers=3 or vers=2
Beats me. My first guess would be some kind of networking problem.
Maybe try running wireshark and watching to see if certain calls aren't
getting responses.
--b.
On Tue, Oct 22, 2019 at 05:34:51PM -0700, Chandler wrote:
> Hi all, I'm sure you get this alot, but I couldn't figure out any solution. We have a client/server pair with both 1Gb and 10Gb network interfaces. I can mount the share on the client on the 1Gb interface just fine and interact with it normally. If I unmount and try to mount the share on the 10Gb interface, it will mount but everything after that hangs (like ls or df). The exports entry is the same on the server, i.e.:
>
> #1Gb interface
> /data 10.10.10.0/24(rw,no_root_squash,async)
> #10Gb interface
> /data 128.196.X.X/28(rw,no_root_squash,async)
>
> I turned off iptables for troubleshooting and checked with the NOC here. Using NFSv4 by default and CentOS 6.10 2.6.32 kernel. I had some strange results if i try vers=3 or vers=2, then i can "ls /data" but if I try to "ls /data/subdir" then it hangs again. Now it doesn't even mount if i try with vers=3 or vers=2
>
I usually use tcpdump to do a raw packet capture. Something like:
# tcpdump -s 0 -w out.pcap host <nfs-server>
(<nfs-server> is the hostname of the other machine, client or server)
<ctrl>C <-- when you think you have enough
Then you can read out.pcap into wireshark.
rick
________________________________________
From: [email protected] <[email protected]> on behalf of Chandler <[email protected]>
Sent: Thursday, October 24, 2019 7:40 PM
Cc: [email protected]
Subject: Re: NFS hangs on one interface
Thanks Bruce.
Do you (or anyone) have an idea how to use wireshark "tshark" on the command line to capture this data? I tried to run it but it captures way too much traffic.. is there perhaps a certain port or ports I could tell it to monitor? 2049? Thanks
Thanks Bruce.
Do you (or anyone) have an idea how to use wireshark "tshark" on the command line to capture this data? I tried to run it but it captures way too much traffic.. is there perhaps a certain port or ports I could tell it to monitor? 2049? Thanks
Does this tcpdump help at all? I ran:
tcpdump -i eth2 -s 0 -w out.pcap host x.4 and x.2
Then I tried "mount -v x.2:/data /data" in another term, waited until after the timeout then ^C then killed the tcpdump
1 0.000000 x.4 -> x.2 TCP 66 739 > nfs [ACK] Seq=1 Ack=1 Win=140 Len=0 TSval=2837501537 TSecr=1577364421
2 0.000034 x.4 -> x.2 NFS 206 [TCP Previous segment not captured] V3 READ Call, FH:0x48bf584a Offset:0 Len:131072
3 0.000127 x.2 -> x.4 TCP 60 nfs > 739 [RST] Seq=1 Win=0 Len=0
4 0.000148 x.2 -> x.4 TCP 60 nfs > 739 [RST] Seq=1 Win=0 Len=0
5 3.000003 x.4 -> x.2 TCP 74 [TCP Port numbers reused] 739 > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=2837504537 TSecr=0 WS=128
6 3.000182 x.2 -> x.4 TCP 74 nfs > 739 [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=1578327421 TSecr=2837504537 WS=128
7 3.000205 x.4 -> x.2 TCP 66 739 > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=2837504537 TSecr=1578327421
8 3.000228 x.4 -> x.2 NFS 206 V3 READ Call, FH:0x48bf584a Offset:0 Len:131072
9 3.000261 x.2 -> x.4 TCP 66 nfs > 739 [ACK] Seq=1 Ack=141 Win=19072 Len=0 TSval=1578327421 TSecr=2837504537
10 4.100351 x.4 -> x.2 NFS 194 V4 Call PUTROOTFH | GETATTR
11 4.139630 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=4113 Ack=129 Win=157 Len=0 TSval=1578328561 TSecr=2837505637
12 44.294611 x.2 -> x.4 TCP 66 netconfsoaphttp > nfs [ACK] Seq=1 Ack=1 Win=140 Len=0 TSval=1578368716 TSecr=2837365831
13 44.294628 x.4 -> x.2 TCP 66 [TCP ACKed unseen segment] nfs > netconfsoaphttp [ACK] Seq=3969 Ack=2 Win=149 Len=0 TSval=2837545831 TSecr=1578188716
14 44.294634 x.2 -> x.4 TCP 66 [TCP Previous segment not captured] netconfsoaphttp > nfs [FIN, ACK] Seq=2 Ack=1 Win=140 Len=0 TSval=1578368716 TSecr=2837365831
15 44.294688 x.4 -> x.2 TCP 66 [TCP ACKed unseen segment] nfs > netconfsoaphttp [FIN, ACK] Seq=3969 Ack=3 Win=149 Len=0 TSval=2837545831 TSecr=1578368716
16 44.294734 x.2 -> x.4 TCP 60 netconfsoaphttp > nfs [RST] Seq=3 Win=0 Len=0
17 47.294699 x.2 -> x.4 TCP 74 [TCP Port numbers reused] netconfsoaphttp > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=1578371716 TSecr=0 WS=128
18 47.294723 x.4 -> x.2 TCP 74 nfs > netconfsoaphttp [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=2837548831 TSecr=1578371716 WS=128
19 47.294770 x.2 -> x.4 TCP 66 netconfsoaphttp > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=1578371716 TSecr=2837548831
20 47.294794 x.2 -> x.4 NFS 270 V4 Call READDIR FH:0xcb5e6e28
21 47.294802 x.4 -> x.2 TCP 66 nfs > netconfsoaphttp [ACK] Seq=1 Ack=205 Win=19072 Len=0 TSval=2837548831 TSecr=1578371716
22 47.294975 x.4 -> x.2 NFS 4034 V4 Reply (Call In 20) READDIR
23 47.495918 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
24 47.897918 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
25 48.701955 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
26 50.309927 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
27 53.525953 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
28 59.957895 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
29 62.999962 x.4 -> x.2 TCP 66 [TCP Keep-Alive] 739 > nfs [ACK] Seq=140 Ack=1 Win=17920 Len=0 TSval=2837564537 TSecr=1578327421
30 63.000082 x.2 -> x.4 TCP 66 [TCP Previous segment not captured] nfs > 739 [ACK] Seq=17897 Ack=141 Win=19072 Len=0 TSval=1578387421 TSecr=2837504537
31 64.099972 x.4 -> x.2 TCP 66 798 > nfs [FIN, ACK] Seq=129 Ack=1 Win=140 Len=0 TSval=2837565637 TSecr=1578304804
32 64.100147 x.2 -> x.4 NFS 326 V4 Reply (Call In 10) PUTROOTFH | GETATTR
33 64.100174 x.4 -> x.2 TCP 54 798 > nfs [RST] Seq=130 Win=0 Len=0
34 67.099990 x.4 -> x.2 TCP 74 [TCP Port numbers reused] 798 > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=2837568637 TSecr=0 WS=128
35 67.100135 x.2 -> x.4 TCP 74 nfs > 798 [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=1578391521 TSecr=2837568637 WS=128
36 67.100158 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=2837568637 TSecr=1578391521
37 67.100181 x.4 -> x.2 NFS 238 V4 Call READDIR FH:0x0366982c
38 67.100222 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=173 Win=19072 Len=0 TSval=1578391521 TSecr=2837568637
39 67.100241 x.4 -> x.2 NFS 194 V4 Call PUTROOTFH | GETATTR
40 67.100285 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=301 Win=20096 Len=0 TSval=1578391521 TSecr=2837568637
41 67.100332 x.2 -> x.4 NFS 326 V4 Reply (Call In 39) PUTROOTFH | GETATTR
42 67.100339 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=301 Ack=261 Win=19072 Len=0 TSval=2837568637 TSecr=1578391521
43 67.111864 x.4 -> x.2 NFS 198 V4 Call GETATTR FH:0x62d40c52
44 67.111991 x.2 -> x.4 NFS 158 [TCP Previous segment not captured] V4 Reply (Call In 43) GETATTR
45 67.112010 x.4 -> x.2 TCP 78 [TCP Dup ACK 43#1] 798 > nfs [ACK] Seq=433 Ack=261 Win=19072 Len=0 TSval=2837568649 TSecr=1578391521 SLE=4373 SRE=4465
46 72.821950 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
47 98.549965 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
48 107.294623 x.2 -> x.4 TCP 66 [TCP Keep-Alive] netconfsoaphttp > nfs [ACK] Seq=204 Ack=1 Win=17920 Len=0 TSval=1578431716 TSecr=2837548831
49 107.294642 x.4 -> x.2 TCP 66 [TCP Keep-Alive ACK] nfs > netconfsoaphttp [ACK] Seq=3969 Ack=205 Win=19072 Len=0 TSval=2837608831 TSecr=1578371716
50 122.999956 x.4 -> x.2 TCP 66 [TCP Keep-Alive] 739 > nfs [ACK] Seq=140 Ack=1 Win=17920 Len=0 TSval=2837624537 TSecr=1578327421
51 123.000088 x.2 -> x.4 TCP 66 [TCP Keep-Alive ACK] nfs > 739 [ACK] Seq=17897 Ack=141 Win=19072 Len=0 TSval=1578447421 TSecr=2837504537
52 123.354599 Solarfla_y -> Solarfla_z ARP 60 Who has x.4? Tell x.2
53 123.354613 Solarfla_z -> Solarfla_y ARP 42 x.4 is at 00:0f:53:z
54 127.110922 x.4 -> x.2 TCP 78 798 > nfs [FIN, ACK] Seq=433 Ack=261 Win=19072 Len=0 TSval=2837628648 TSecr=1578391521 SLE=4373 SRE=4465
55 127.111041 x.2 -> x.4 TCP 66 nfs > 798 [FIN, ACK] Seq=4465 Ack=434 Win=21120 Len=0 TSval=1578451532 TSecr=2837628648
56 127.111068 x.4 -> x.2 TCP 54 798 > nfs [RST] Seq=434 Win=0 Len=0
57 130.110999 x.4 -> x.2 TCP 74 [TCP Port numbers reused] 798 > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=2837631648 TSecr=0 WS=128
58 130.111106 x.2 -> x.4 TCP 74 nfs > 798 [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=1578454532 TSecr=2837631648 WS=128
59 130.111133 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=2837631648 TSecr=1578454532
60 130.111146 x.4 -> x.2 NFS 238 V4 Call READDIR FH:0x0366982c
61 130.111162 x.4 -> x.2 NFS 198 V4 Call GETATTR FH:0x62d40c52
62 130.111198 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=173 Win=19072 Len=0 TSval=1578454532 TSecr=2837631648
63 130.111208 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=305 Win=20096 Len=0 TSval=1578454532 TSecr=2837631648
64 130.111290 x.2 -> x.4 NFS 158 V4 Reply (Call In 61) GETATTR
65 130.111296 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=305 Ack=93 Win=17920 Len=0 TSval=2837631648 TSecr=1578454532
66 130.111367 x.4 -> x.2 NFS 202 V4 Call GETATTR FH:0x62d40c52
67 130.111430 x.2 -> x.4 NFS 178 V4 Reply (Call In 66) GETATTR
68 130.111463 x.4 -> x.2 NFS 198 V4 Call GETATTR FH:0x62d40c52
69 130.111513 x.2 -> x.4 NFS 158 [TCP Previous segment not captured] V4 Reply (Call In 68) GETATTR
70 130.111521 x.4 -> x.2 TCP 78 [TCP Dup ACK 68#1] 798 > nfs [ACK] Seq=573 Ack=205 Win=17920 Len=0 TSval=2837631648 TSecr=1578454532 SLE=4317 SRE=4409
It's too hard to read this tcpdump-style network trace with multiple
nfs streams (a full .cap file would be much better) (internals of the
packets are hidden).
Some things that stick out. If you are doing a v4.0 mount, it
typically would start with a SETCLIENTID. Yours starts with a
PUTROOTFH which means you already have a 4.0 mount going to this
server. "cat /proc/fs/nfsfs/server" would show you mounts to that
server. If you are not expecting that you already had an existing 4.0
mount (ie., your "mount" command doesn't show that server mounted),
then things have gone wrong already and you have a stuck mount which
might be interfering with further mounts.
Are you experiencing issues with a fresh boot ? do you have an
ability/luxury to reboot the client machine?
Your problem description is confusing. Your last network trace is
about a failing v4.0 mount. Your initial description is talking about
mounting with "vers=3" or "vers=2". So is the problem with a specific
nfs version or is the problem with mounting over 10GB interface with
any NFS versions?
You can also turn on rpcdebug messages (if your client machine isn't
getting a lot of NFS traffic) but given your trace I see multiple
streams so you'll have to dig thru lots of output to follow your own
NFS operations.
On Mon, Nov 4, 2019 at 7:29 PM Chandler <[email protected]> wrote:
>
> Any ideas what's going on here?
> Thanks
Hi Chandler,
Given what you say, it sounds to me more like a generic networking
issue between this particular problem and the server.
debug messages are logging that client can't reach the server:
Nov 8 17:58:21 NFSclient kernel: nfs: server x.2 not responding, still trying
I'd recommend making sure that your network works alright between
those interfaces. Perhaps running an iperf for a few minutes to make
sure you are seeing expected, consistent performance between those two
nodes. Another thing to check if you for some reason have duplicate
IPs in the system that can show up as weird hangs.
On Fri, Nov 8, 2019 at 8:22 PM Chandler <[email protected]> wrote:
>
> Hi Olga, thanks so much for your help.
>
> I tried to reboot and still having weird issues. If I mount over the local network (10.x address) then there are no issues. As soon as I try to mount over the 10G network, weird things happen. For example, I can perform the mount just fine and do "ll /mount" but as soon as I try another directory like "ll /mount/users" then it hangs. Also this only happens between these two machines with 10G interfaces. The server with the 10G interface has several other 1G clients that outside the local 10.x network that connect to it on the 10G interface, and those clients all work fine as well, so seems to be an issue specific to this client on the 10G interface.
>
> In my earlier post, I did try troubleshooting with vers=3 and vers=2 just to see if that was the issue, but since then have been using the defaults (so vers=4).
>
> I turned on the rpcdebug on both client and server with "rpcdebug -m nfs(d) all" but it seemed to lock up the server and i had to reboot it, so will keep that off for now. I attached a log of the debug messages from the client showing what commands I executed (snoopy) and the resulting kernel debug entries, hope this helps. The hangup happens at the end when I ls -l on the users directory. Let me know if there's anything else I can provide.
>
Yes I don't understand what's going on with the network. I can ssh to the server from the client over the 10G interfaces, login and get to a prompt. I can even run some commands, but as soon as I try "top" then the session freezes, top works just fine if I ssh from my workstation to the server over the 10G interface, and top works fine if i ssh from the client to the server over the 1G interface..... maybe post on LinuxQuestions or something??
Seems the problem was a mis-matched MTU setting with the switch. Now that the port on the switch is set to 9000, everything is working. Thanks for all your help.