2024-06-13 22:44:34

by Chris Friesen

[permalink] [raw]
Subject: Seeing strange behaviour from RPC/NFSD on 6.6.7 kernel, looking for debug advice

Hi all,

I'm not subscribed to the list so please CC me on replies. I'm seeing
some odd behaviour and I'm not sure what's going on.

The short version is that I can mount NFS exports from this server
(including mounting on the server itself) via TCP but not via UDP.

rpcinfo on the server gives the following:

root@controller-0:/var/home/sysadmin# rpcinfo -s
program version(s) netid(s) service owner
100000 2,3,4 local,udp,tcp,udp6,tcp6 portmapper superuser
100024 1 tcp6,udp6,tcp,udp status 116
100003 4,3 udp6,tcp6,udp,tcp nfs superuser
100227 3 udp6,tcp6,udp,tcp - superuser
100021 4,3,1 tcp6,udp6,tcp,udp nlockmgr superuser
100005 3,2,1 tcp6,udp6,tcp,udp mountd superuser

root@controller-0:/var/home/sysadmin# rpcinfo -p
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100024 1 udp 44993 status
100024 1 tcp 60561 status
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs
100227 3 tcp 2049
100003 3 udp 2049 nfs
100227 3 udp 2049
100021 1 udp 36993 nlockmgr
100021 3 udp 36993 nlockmgr
100021 4 udp 36993 nlockmgr
100021 1 tcp 33239 nlockmgr
100021 3 tcp 33239 nlockmgr
100021 4 tcp 33239 nlockmgr
100005 1 udp 43636 mountd
100005 1 tcp 49357 mountd
100005 2 udp 39783 mountd
100005 2 tcp 54527 mountd
100005 3 udp 40970 mountd
100005 3 tcp 41761 mountd


So it *looks* okay to me, but if I probe it I get the expected response
for TCP but not for UDP:

root@controller-0:/var/home/sysadmin# rpcinfo -T tcp localhost 100003 4
program 100003 version 4 ready and waiting
root@controller-0:/var/home/sysadmin# rpcinfo -T tcp localhost 100003 3
program 100003 version 3 ready and waiting
root@controller-0:/var/home/sysadmin# rpcinfo -T udp localhost 100003 3
rpcinfo: RPC: Timed out
program 100003 version 3 is not available


Oddly the last command returns immediately so it's not actually timing
out, the error message is a bit misleading. If I sniff the network
traffic while running that last command, I see this:

root@controller-0:/var/home/sysadmin# tcpdump -vvv -i lo port 2049
tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length
262144 bytes
20:20:53.159743 IP (tos 0x0, ttl 64, id 58251, offset 0, flags [DF],
proto UDP (17), length 68)
localhost.887 > localhost.nfs: NFS request xid 1717614110 40 null
20:20:53.160016 IP (tos 0x0, ttl 64, id 58252, offset 0, flags [DF],
proto UDP (17), length 28)
localhost.nfs > localhost.887: [bad udp cksum 0xfe1b -> 0xf663!]
UDP, length 0


Ignoring the checksums, I get this:

root@controller-0:/var/home/sysadmin# tcpdump -vvv -K -i lo port 2049
tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length
262144 bytes
20:21:49.642588 IP (tos 0x0, ttl 64, id 22026, offset 0, flags [DF],
proto UDP (17), length 68)
localhost.934 > localhost.nfs: NFS request xid 1718297876 40 null
20:21:49.642697 IP (tos 0x0, ttl 64, id 22027, offset 0, flags [DF],
proto UDP (17), length 28)
localhost.nfs > localhost.934: UDP, length 0


On a separate system with a 5.10 kernel I get a successful response that
is longer and is properly decoded:

root@controller-0:/var/home/sysadmin# tcpdump -vvv -K -i lo port 2049
tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length
262144 bytes
22:22:00.388497 IP (tos 0x0, ttl 64, id 806, offset 0, flags [DF], proto
UDP (17), length 68)
localhost.808 > localhost.nfs: NFS request xid 1718037578 40 null
22:22:00.388525 IP (tos 0x0, ttl 64, id 807, offset 0, flags [DF], proto
UDP (17), length 52)
localhost.nfs > localhost.808: NFS reply xid 1718037578 reply ok 24
null


Anyone have any ideas what might be going on or how to debug? I'm
building a kernel with CONFIG_SUNRPC_DEBUG enabled to see if that gives
anything useful.

Thanks,
Chris


2024-06-14 13:09:22

by Chuck Lever III

[permalink] [raw]
Subject: Re: Seeing strange behaviour from RPC/NFSD on 6.6.7 kernel, looking for debug advice



> On Jun 13, 2024, at 6:44 PM, Chris Friesen <[email protected]> wrote:
>
> Hi all,
>
> I'm not subscribed to the list so please CC me on replies. I'm seeing some odd behaviour and I'm not sure what's going on.
>
> The short version is that I can mount NFS exports from this server (including mounting on the server itself) via TCP but not via UDP.
>
> rpcinfo on the server gives the following:
>
> root@controller-0:/var/home/sysadmin# rpcinfo -s
> program version(s) netid(s) service owner
> 100000 2,3,4 local,udp,tcp,udp6,tcp6 portmapper superuser
> 100024 1 tcp6,udp6,tcp,udp status 116
> 100003 4,3 udp6,tcp6,udp,tcp nfs superuser
> 100227 3 udp6,tcp6,udp,tcp - superuser
> 100021 4,3,1 tcp6,udp6,tcp,udp nlockmgr superuser
> 100005 3,2,1 tcp6,udp6,tcp,udp mountd superuser
>
> root@controller-0:/var/home/sysadmin# rpcinfo -p
> program vers proto port service
> 100000 4 tcp 111 portmapper
> 100000 3 tcp 111 portmapper
> 100000 2 tcp 111 portmapper
> 100000 4 udp 111 portmapper
> 100000 3 udp 111 portmapper
> 100000 2 udp 111 portmapper
> 100024 1 udp 44993 status
> 100024 1 tcp 60561 status
> 100003 3 tcp 2049 nfs
> 100003 4 tcp 2049 nfs
> 100227 3 tcp 2049
> 100003 3 udp 2049 nfs
> 100227 3 udp 2049
> 100021 1 udp 36993 nlockmgr
> 100021 3 udp 36993 nlockmgr
> 100021 4 udp 36993 nlockmgr
> 100021 1 tcp 33239 nlockmgr
> 100021 3 tcp 33239 nlockmgr
> 100021 4 tcp 33239 nlockmgr
> 100005 1 udp 43636 mountd
> 100005 1 tcp 49357 mountd
> 100005 2 udp 39783 mountd
> 100005 2 tcp 54527 mountd
> 100005 3 udp 40970 mountd
> 100005 3 tcp 41761 mountd
>
>
> So it *looks* okay to me, but if I probe it I get the expected response for TCP but not for UDP:
>
> root@controller-0:/var/home/sysadmin# rpcinfo -T tcp localhost 100003 4
> program 100003 version 4 ready and waiting
> root@controller-0:/var/home/sysadmin# rpcinfo -T tcp localhost 100003 3
> program 100003 version 3 ready and waiting
> root@controller-0:/var/home/sysadmin# rpcinfo -T udp localhost 100003 3
> rpcinfo: RPC: Timed out
> program 100003 version 3 is not available
>
>
> Oddly the last command returns immediately so it's not actually timing out, the error message is a bit misleading. If I sniff the network traffic while running that last command, I see this:
>
> root@controller-0:/var/home/sysadmin# tcpdump -vvv -i lo port 2049
> tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 20:20:53.159743 IP (tos 0x0, ttl 64, id 58251, offset 0, flags [DF], proto UDP (17), length 68)
> localhost.887 > localhost.nfs: NFS request xid 1717614110 40 null
> 20:20:53.160016 IP (tos 0x0, ttl 64, id 58252, offset 0, flags [DF], proto UDP (17), length 28)
> localhost.nfs > localhost.887: [bad udp cksum 0xfe1b -> 0xf663!] UDP, length 0
>
>
> Ignoring the checksums, I get this:
>
> root@controller-0:/var/home/sysadmin# tcpdump -vvv -K -i lo port 2049
> tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 20:21:49.642588 IP (tos 0x0, ttl 64, id 22026, offset 0, flags [DF], proto UDP (17), length 68)
> localhost.934 > localhost.nfs: NFS request xid 1718297876 40 null
> 20:21:49.642697 IP (tos 0x0, ttl 64, id 22027, offset 0, flags [DF], proto UDP (17), length 28)
> localhost.nfs > localhost.934: UDP, length 0
>
>
> On a separate system with a 5.10 kernel I get a successful response that is longer and is properly decoded:
>
> root@controller-0:/var/home/sysadmin# tcpdump -vvv -K -i lo port 2049
> tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 22:22:00.388497 IP (tos 0x0, ttl 64, id 806, offset 0, flags [DF], proto UDP (17), length 68)
> localhost.808 > localhost.nfs: NFS request xid 1718037578 40 null
> 22:22:00.388525 IP (tos 0x0, ttl 64, id 807, offset 0, flags [DF], proto UDP (17), length 52)
> localhost.nfs > localhost.808: NFS reply xid 1718037578 reply ok 24 null
>
>
> Anyone have any ideas what might be going on or how to debug? I'm building a kernel with CONFIG_SUNRPC_DEBUG enabled to see if that gives anything useful.

linux-6.6.15 has "SUNRPC: use request size to initialize bio_vec in
svc_udp_sendto()" which might address this issue.


--
Chuck Lever