Hello,
I am experiencing random NFS client lockups after 1-2 days. Kernel
reports
| nfs: server XXXXXXXXX not responding, timed out
processes are in D state and only a reboot helps.
The full log is available at https://pastebin.pl/view/7d0b345b
I can see one oddity there: shortly before the timeouts, log shows at
05:07:28:
| worker connecting xprt 0000000022aecad1 via tcp to XXXX:2001:1022:: (port 2049)
| 0000000022aecad1 connect status 0 connected 0 sock state 8
All other connects go in EINPROGRESS first
| 0000000022aecad1 connect status 115 connected 0 sock state 2
| ...
| state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
After 'status 0', rpcdebug shows (around 05:07:43)
| --> nfs4_alloc_slot used_slots=03ff highest_used=9 max_slots=30
| <-- nfs4_alloc_slot used_slots=07ff highest_used=10 slotid=10
| ...
| <-- nfs4_alloc_slot used_slots=fffffff highest_used=27 slotid=27
| --> nfs4_alloc_slot used_slots=fffffff highest_used=27 max_slots=30
| ...
| --> nfs4_alloc_slot used_slots=3fffffff highest_used=29 max_slots=30
| <-- nfs4_alloc_slot used_slots=3fffffff highest_used=29 slotid=4294967295
| nfs41_sequence_process: Error 1 free the slot
and nfs server times out then.
The server reports nearly at this time
| Mar 16 05:02:40 kernel: rpc-srv/tcp: nfsd: got error -32 when sending 112 bytes - shutting down socket
Similar message (with other sizes and sometime error -104) appear
frequently without a related client lockup.
How can I debug this further resp. solve it?
It happens (at least) with:
- a Fedora 35 client with kernel 5.16.7,
kernel-5.17.0-0.rc7.20220310git3bf7edc84a9e.119.fc37.x86_64 and some
other between them
- a Rocky Linux 8,5 server with kernel-4.18.0-348.12.2
and kernel-4.18.0-348.2.1
Problem started after a power outage were whole infrastructure rebooted.
I ran the setup with kernel 5.16.7 on client and 4.18.0-348.2.1 on server
without problems before the outage.
Issue affects a /home directory mounted with
| XXXX:/home /home nfs4 rw,seclabel,nosuid,nodev,relatime,vers=4.2,rsize=262144,wsize=262144,namlen=255,soft,posix,proto=tcp6,timeo=600,retrans=2,sec=krb5p,clientaddr=XXXX:2001:...,local_lock=none,addr=XXXX:2001:1022:: 0 0
Happens also without the "soft" option.
There are applications like firefox or chromium running which held locks
and access the filesystem frequently.
Logfile was created when rpcdebug enabled
| nfs proc xdr root callback client mount pnfs pnfs_ld state
| rpc xprt call debug nfs auth bind sched trans svcsock svcdsp misc cache
Enrico
Trond Myklebust <[email protected]> writes:
>> Mar 16 05:02:40.051969: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
>> Mar 16 05:02:40.052067: RPC: xs_close xprt 0000000022aecad1
>> Mar 16 05:02:40.052189: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
>> Mar 16 05:02:40.052243: RPC: xs_error_report client 0000000022aecad1, error=32...
>> Mar 16 05:02:40.052367: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
>> Mar 16 05:02:40.052503: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
>> Mar 16 05:02:40.053201: RPC: xs_connect scheduled xprt 0000000022aecad1
>> Mar 16 05:02:40.055886: RPC: xs_bind 0000:0000:0000:0000:0000:0000:0000:0000:875: ok (0)
>> __A__ 05:02:40.055947: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
>> Mar 16 05:02:40.055995: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2
>
> Socket is in TCP state 2 == SYN_SENT
> So the client requested to connect to the server
server closed the connection (state 8, CLOSE_WAIT), client cleaned up
correctly and reconnected.
>> Mar 16 05:07:28.326605: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
>
> Socket is now in TCP state 8 == TCP_CLOSE_WAIT...
>
> That means the server sent a FIN to the client to request that the
> connection be closed.
yes; the same situation like above
>> Mar 16 05:07:28.326679: RPC: xs_connect scheduled xprt 0000000022aecad1
>> __B__ 05:07:28.326978: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
>> Mar 16 05:07:28.327050: RPC: 0000000022aecad1 connect status 0 connected 0 sock state 8
>> __C__ 05:07:28.327113: RPC: xs_close xprt 0000000022aecad1
>
> Client closes the socket, which is still in TCP_CLOSE_WAIT
the 'xs_close' is very likely a reaction to the state change reported
above and should happen before 'xs_connect'.
> Basically, what the above means is that your server is initiating the
> close of the connection, not the client.
yes; but the client should reconnect (and does it in most cases).
Sometimes there seems to be a race which prevents the reconnect and
brings the client in a broken state.
Enrico
On Wed, 2022-03-16 at 21:55 +0100, Enrico Scholz wrote:
> [You don't often get email from [email protected].
> Learn why this is important at
> http://aka.ms/LearnAboutSenderIdentification.]
>
> Enrico Scholz <[email protected]> writes:
>
> > I am experiencing random NFS client lockups after 1-2 days.
> > ...
> > The full log is available at https://pastebin.pl/view/7d0b345b
>
> Filtering only RPC messages[1] shows
>
> -----
> Mar 16 04:48:58.345357: RPC: worker connecting xprt
> 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
> Mar 16 04:48:58.345487: RPC: 0000000022aecad1 connect status
> 115 connected 0 sock state 2
> Mar 16 04:48:58.345705: RPC: state 1 conn 0 dead 0 zapped 1
> sk_shutdown 0
> Mar 16 05:02:40.037783: RPC: state 8 conn 1 dead 0 zapped 1
> sk_shutdown 1
> Mar 16 05:02:40.037921: RPC: xs_close xprt 0000000022aecad1
> Mar 16 05:02:40.038076: RPC: state 9 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.038211: RPC: state 7 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.038293: RPC: xs_error_report client
> 0000000022aecad1, error=104...
> Mar 16 05:02:40.038449: RPC: state 7 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.039134: RPC: xs_connect scheduled xprt
> 0000000022aecad1
> Mar 16 05:02:40.042991: RPC: xs_bind
> 0000:0000:0000:0000:0000:0000:0000:0000:883: ok (0)
> Mar 16 05:02:40.043072: RPC: worker connecting xprt
> 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
> Mar 16 05:02:40.043159: RPC: 0000000022aecad1 connect status
> 115 connected 0 sock state 2
> Mar 16 05:02:40.043308: RPC: state 1 conn 0 dead 0 zapped 1
> sk_shutdown 0
> Mar 16 05:02:40.045883: RPC: state 8 conn 1 dead 0 zapped 1
> sk_shutdown 1
> Mar 16 05:02:40.046163: RPC: xs_close xprt 0000000022aecad1
> Mar 16 05:02:40.046410: RPC: state 9 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.046625: RPC: state 9 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.047428: RPC: xs_connect scheduled xprt
> 0000000022aecad1
> Mar 16 05:02:40.050387: RPC: xs_bind
> 0000:0000:0000:0000:0000:0000:0000:0000:822: ok (0)
> Mar 16 05:02:40.050614: RPC: worker connecting xprt
> 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
> Mar 16 05:02:40.050662: RPC: 0000000022aecad1 connect status
> 115 connected 0 sock state 2
> Mar 16 05:02:40.050788: RPC: state 1 conn 0 dead 0 zapped 1
> sk_shutdown 0
> Mar 16 05:02:40.051969: RPC: state 8 conn 1 dead 0 zapped 1
> sk_shutdown 1
> Mar 16 05:02:40.052067: RPC: xs_close xprt 0000000022aecad1
> Mar 16 05:02:40.052189: RPC: state 7 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.052243: RPC: xs_error_report client
> 0000000022aecad1, error=32...
> Mar 16 05:02:40.052367: RPC: state 7 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.052503: RPC: state 7 conn 0 dead 0 zapped 1
> sk_shutdown 3
> Mar 16 05:02:40.053201: RPC: xs_connect scheduled xprt
> 0000000022aecad1
> Mar 16 05:02:40.055886: RPC: xs_bind
> 0000:0000:0000:0000:0000:0000:0000:0000:875: ok (0)
> __A__ 05:02:40.055947: RPC: worker connecting xprt
> 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
> Mar 16 05:02:40.055995: RPC: 0000000022aecad1 connect status
> 115 connected 0 sock state 2
Socket is in TCP state 2 == SYN_SENT
So the client requested to connect to the server
> Mar 16 05:02:40.056125: RPC: state 1 conn 0 dead 0 zapped 1
> sk_shutdown 0
Socket is in TCP state 1 == TCP_ESTABLISHED
The connection to the server was accepted, and the connection is
established.
> Mar 16 05:07:28.326605: RPC: state 8 conn 1 dead 0 zapped 1
> sk_shutdown 1
Socket is now in TCP state 8 == TCP_CLOSE_WAIT...
That means the server sent a FIN to the client to request that the
connection be closed.
(see
https://users.cs.northwestern.edu/~agupta/cs340/project2/TCPIP_State_Transition_Diagram.pdf
for more information on the TCP state diagram)
> Mar 16 05:07:28.326679: RPC: xs_connect scheduled xprt
> 0000000022aecad1
> __B__ 05:07:28.326978: RPC: worker connecting xprt
> 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
> Mar 16 05:07:28.327050: RPC: 0000000022aecad1 connect status 0
> connected 0 sock state 8
> __C__ 05:07:28.327113: RPC: xs_close xprt 0000000022aecad1
Client closes the socket, which is still in TCP_CLOSE_WAIT
> Mar 16 05:07:28.327229: RPC: state 9 conn 0 dead 0 zapped 1
> sk_shutdown 3
...and then transitions to TCP state 9 == TCP_LAST_ACK
> Mar 16 05:07:28.327446: RPC: state 9 conn 0 dead 0 zapped 1
> sk_shutdown 3
> -----
>
For more information on TCP states, see
https://users.cs.northwestern.edu/~agupta/cs340/project2/TCPIP_State_Transition_Diagram.pdf
Basically, what the above means is that your server is initiating the
close of the connection, not the client.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Enrico Scholz <[email protected]> writes:
> I am experiencing random NFS client lockups after 1-2 days.
> ...
> The full log is available at https://pastebin.pl/view/7d0b345b
Filtering only RPC messages[1] shows
-----
Mar 16 04:48:58.345357: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
Mar 16 04:48:58.345487: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2
Mar 16 04:48:58.345705: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0
Mar 16 05:02:40.037783: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
Mar 16 05:02:40.037921: RPC: xs_close xprt 0000000022aecad1
Mar 16 05:02:40.038076: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.038211: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.038293: RPC: xs_error_report client 0000000022aecad1, error=104...
Mar 16 05:02:40.038449: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.039134: RPC: xs_connect scheduled xprt 0000000022aecad1
Mar 16 05:02:40.042991: RPC: xs_bind 0000:0000:0000:0000:0000:0000:0000:0000:883: ok (0)
Mar 16 05:02:40.043072: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
Mar 16 05:02:40.043159: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2
Mar 16 05:02:40.043308: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0
Mar 16 05:02:40.045883: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
Mar 16 05:02:40.046163: RPC: xs_close xprt 0000000022aecad1
Mar 16 05:02:40.046410: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.046625: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.047428: RPC: xs_connect scheduled xprt 0000000022aecad1
Mar 16 05:02:40.050387: RPC: xs_bind 0000:0000:0000:0000:0000:0000:0000:0000:822: ok (0)
Mar 16 05:02:40.050614: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
Mar 16 05:02:40.050662: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2
Mar 16 05:02:40.050788: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0
Mar 16 05:02:40.051969: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
Mar 16 05:02:40.052067: RPC: xs_close xprt 0000000022aecad1
Mar 16 05:02:40.052189: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.052243: RPC: xs_error_report client 0000000022aecad1, error=32...
Mar 16 05:02:40.052367: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.052503: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:02:40.053201: RPC: xs_connect scheduled xprt 0000000022aecad1
Mar 16 05:02:40.055886: RPC: xs_bind 0000:0000:0000:0000:0000:0000:0000:0000:875: ok (0)
__A__ 05:02:40.055947: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
Mar 16 05:02:40.055995: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2
Mar 16 05:02:40.056125: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0
Mar 16 05:07:28.326605: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1
Mar 16 05:07:28.326679: RPC: xs_connect scheduled xprt 0000000022aecad1
__B__ 05:07:28.326978: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049)
Mar 16 05:07:28.327050: RPC: 0000000022aecad1 connect status 0 connected 0 sock state 8
__C__ 05:07:28.327113: RPC: xs_close xprt 0000000022aecad1
Mar 16 05:07:28.327229: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3
Mar 16 05:07:28.327446: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3
-----
For me, it seems that 'xs_close' at position __C__ belongs to the
connection at __A__ but is applied to the newly created connection
__B__.
Because the close did not happened yet, __B__ operated on the old socket
which is in state 8 (CLOSE_WAIT) and connect returned with 0 instead of
-EINPROGRESS hence. There will not be generated a new connection in
this case.
It seems to be some race where 'xs_connect()' is executed before a
delayed 'xs_error_handle()'. Both functions are called in reaction of
a 'CLOSE_WAIT'.
Enrico
On Wed, 2022-03-16 at 23:05 +0100, Enrico Scholz wrote:
> [You don't often get email from [email protected].
> Learn why this is important at
> http://aka.ms/LearnAboutSenderIdentification.]
>
> Trond Myklebust <[email protected]> writes:
>
> > Basically, what the above means is that your server is initiating
> > the
> > close of the connection, not the client.
>
> yes; but the client should reconnect (and does it in most cases).
> Sometimes there seems to be a race which prevents the reconnect and
> brings the client in a broken state.
>
I didn't realise that you were seeing a long term hang.
Can you please try the attached patch and see if it helps?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Trond Myklebust <[email protected]> writes:
>> yes; but the client should reconnect (and does it in most cases).
>> Sometimes there seems to be a race which prevents the reconnect and
>> brings the client in a broken state.
>>
>
> I didn't realise that you were seeing a long term hang.
> Can you please try the attached patch and see if it helps?
Thanks; I applied the patch and it runs without problems so far. It
fixes the connect-close race very likely. I hope, this was the reason
for the lockups but it is too early to confirm as they usually take 1-2
days before they occur.
Enrico
Enrico Scholz <[email protected]> writes:
>> Can you please try the attached patch and see if it helps?
>
> Thanks; I applied the patch and it runs without problems so far. It
> fixes the connect-close race very likely. I hope, this was the reason
> for the lockups but it is too early to confirm as they usually take 1-2
> days before they occur.
Client runs now for >10 days without seeing the lockups or other
regressions.
Thanks again
Enrico