2023-01-23 16:31:14

by Andrew Klaassen

[permalink] [raw]
Subject: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

Hello,

There's a specific NFSv4 mount on a specific machine which we'd like to timeout and return an error after a few seconds if the server goes away.

I've confirmed the following on two different kernels, 4.18.0-348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.

I've been able to get both autofs and the mount command to cooperate, so that the mount attempt fails after an arbitrary number of seconds. This mount command, for example, will fail after 6 seconds, as expected based on the timeo=20,retrans=2,retry=0 options:

$ time sudo mount -t nfs4 -o rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,retrans=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
mount.nfs4: Connection timed out

real 0m6.084s
user 0m0.007s
sys 0m0.015s

However, if the share is already mounted and the server goes away, the timeout is always 2 minutes plus the time I expect based on timeo and retrans. In this case, 2 minutes and 6 seconds:

$ time ls /mnt/thor04
ls: cannot access '/mnt/thor04': Connection timed out

real 2m6.025s
user 0m0.003s
sys 0m0.000s

Watching the outgoing packets in the second case, the pattern is always the same:
- 0.2 seconds between the first two, then doubling each time until the two minute mark is exceeded (so the last NFS packet, which is always the 11th packet, is sent around 1:45 after the first).
- Then some generic packets that start exactly-ish on the two minute mark, 1 second between the first two, then doubling each time. (By this time the NFS command has given up.)

11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835010130 ecr 0,nop,wscale 7], length 0
11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835011155 ecr 0,nop,wscale 7], length 0
11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835013202 ecr 0,nop,wscale 7], length 0
11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835017234 ecr 0,nop,wscale 7], length 0
11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835025490 ecr 0,nop,wscale 7], length 0
11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835041874 ecr 0,nop,wscale 7], length 0
11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835074130 ecr 0,nop,wscale 7], length 0

I tried changing tcp_retries2 as suggested in another thread from this list:

# echo 3 > /proc/sys/net/ipv4/tcp_retries2

...but it made no difference on either kernel. The 2 minute timeout also doesn't seem to match with what I'd calculate from the initial value of tcp_retries2, which should give a much higher timeout.

The only clue I've been able to find is in the retry=n entry in the NFS manpage:

" For TCP the default is 3 minutes, but system TCP connection timeouts will sometimes limit the timeout of each retransmission to around 2 minutes."

What I'm not able to make sense of:
- The retry option says that it applies to mount operations, not read/write operations. However, in this case I'm seeing the 2 minute delay on read/write operations but *not* mount operations.
- A couple of hours of searching didn't lead me to any kernel settings that would result in a 2 minute timeout.

Does anyone have any clues about a) what's happening and b) how to get our desired behaviour of being able to control both mount and read/write timeouts down to a few seconds?

Thanks.

Andrew



2023-01-23 16:35:15

by Chuck Lever

[permalink] [raw]
Subject: Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection



> On Jan 23, 2023, at 11:31 AM, Andrew Klaassen <[email protected]> wrote:
>
> Hello,
>
> There's a specific NFSv4 mount on a specific machine which we'd like to timeout and return an error after a few seconds if the server goes away.
>
> I've confirmed the following on two different kernels, 4.18.0-348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
>
> I've been able to get both autofs and the mount command to cooperate, so that the mount attempt fails after an arbitrary number of seconds. This mount command, for example, will fail after 6 seconds, as expected based on the timeo=20,retrans=2,retry=0 options:
>
> $ time sudo mount -t nfs4 -o rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,retrans=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> mount.nfs4: Connection timed out
>
> real 0m6.084s
> user 0m0.007s
> sys 0m0.015s
>
> However, if the share is already mounted and the server goes away, the timeout is always 2 minutes plus the time I expect based on timeo and retrans. In this case, 2 minutes and 6 seconds:
>
> $ time ls /mnt/thor04
> ls: cannot access '/mnt/thor04': Connection timed out
>
> real 2m6.025s
> user 0m0.003s
> sys 0m0.000s
>
> Watching the outgoing packets in the second case, the pattern is always the same:
> - 0.2 seconds between the first two, then doubling each time until the two minute mark is exceeded (so the last NFS packet, which is always the 11th packet, is sent around 1:45 after the first).
> - Then some generic packets that start exactly-ish on the two minute mark, 1 second between the first two, then doubling each time. (By this time the NFS command has given up.)
>
> 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306 ecr 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835010130 ecr 0,nop,wscale 7], length 0
> 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835011155 ecr 0,nop,wscale 7], length 0
> 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835013202 ecr 0,nop,wscale 7], length 0
> 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835017234 ecr 0,nop,wscale 7], length 0
> 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835025490 ecr 0,nop,wscale 7], length 0
> 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835041874 ecr 0,nop,wscale 7], length 0
> 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS val 835074130 ecr 0,nop,wscale 7], length 0
>
> I tried changing tcp_retries2 as suggested in another thread from this list:
>
> # echo 3 > /proc/sys/net/ipv4/tcp_retries2
>
> ...but it made no difference on either kernel. The 2 minute timeout also doesn't seem to match with what I'd calculate from the initial value of tcp_retries2, which should give a much higher timeout.
>
> The only clue I've been able to find is in the retry=n entry in the NFS manpage:
>
> " For TCP the default is 3 minutes, but system TCP connection timeouts will sometimes limit the timeout of each retransmission to around 2 minutes."
>
> What I'm not able to make sense of:
> - The retry option says that it applies to mount operations, not read/write operations. However, in this case I'm seeing the 2 minute delay on read/write operations but *not* mount operations.
> - A couple of hours of searching didn't lead me to any kernel settings that would result in a 2 minute timeout.
>
> Does anyone have any clues about a) what's happening and b) how to get our desired behaviour of being able to control both mount and read/write timeouts down to a few seconds?

If the server is already mounted on that client at another mount point,
then the client will share the transport amongst mounts of the same server.

The first mount's options take precedent, and subsequent mounts re-use
that mount's transport and the mount options that control it.


--
Chuck Lever




2023-01-23 16:44:05

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Chuck Lever III <[email protected]>
> Sent: Monday, January 23, 2023 11:35 AM
>
> > On Jan 23, 2023, at 11:31 AM, Andrew Klaassen
> <[email protected]> wrote:
> >
> > Hello,
> >
> > There's a specific NFSv4 mount on a specific machine which we'd like to
> timeout and return an error after a few seconds if the server goes away.
> >
> > I've confirmed the following on two different kernels, 4.18.0-
> 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> >
> > I've been able to get both autofs and the mount command to cooperate,
> so that the mount attempt fails after an arbitrary number of seconds. This
> mount command, for example, will fail after 6 seconds, as expected based on
> the timeo=20,retrans=2,retry=0 options:
> >
> > $ time sudo mount -t nfs4 -o
> > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmi
> >
> n=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,retr
> > ans=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > mount.nfs4: Connection timed out
> >
> > real 0m6.084s
> > user 0m0.007s
> > sys 0m0.015s
> >
> > However, if the share is already mounted and the server goes away, the
> timeout is always 2 minutes plus the time I expect based on timeo and
> retrans. In this case, 2 minutes and 6 seconds:
> >
> > $ time ls /mnt/thor04
> > ls: cannot access '/mnt/thor04': Connection timed out
> >
> > real 2m6.025s
> > user 0m0.003s
> > sys 0m0.000s
> >
> > Watching the outgoing packets in the second case, the pattern is always
> the same:
> > - 0.2 seconds between the first two, then doubling each time until the two
> minute mark is exceeded (so the last NFS packet, which is always the 11th
> packet, is sent around 1:45 after the first).
> > - Then some generic packets that start exactly-ish on the two minute
> > mark, 1 second between the first two, then doubling each time. (By
> > this time the NFS command has given up.)
> >
> > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835010130 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835011155 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835013202 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835017234 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835025490 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835041874 ecr
> > 0,nop,wscale 7], length 0
> > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835074130 ecr
> > 0,nop,wscale 7], length 0
> >
> > I tried changing tcp_retries2 as suggested in another thread from this list:
> >
> > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> >
> > ...but it made no difference on either kernel. The 2 minute timeout also
> doesn't seem to match with what I'd calculate from the initial value of
> tcp_retries2, which should give a much higher timeout.
> >
> > The only clue I've been able to find is in the retry=n entry in the NFS
> manpage:
> >
> > " For TCP the default is 3 minutes, but system TCP connection timeouts will
> sometimes limit the timeout of each retransmission to around 2 minutes."
> >
> > What I'm not able to make sense of:
> > - The retry option says that it applies to mount operations, not read/write
> operations. However, in this case I'm seeing the 2 minute delay on
> read/write operations but *not* mount operations.
> > - A couple of hours of searching didn't lead me to any kernel settings that
> would result in a 2 minute timeout.
> >
> > Does anyone have any clues about a) what's happening and b) how to get
> our desired behaviour of being able to control both mount and read/write
> timeouts down to a few seconds?
>
> If the server is already mounted on that client at another mount point, then
> the client will share the transport amongst mounts of the same server.
>
> The first mount's options take precedent, and subsequent mounts re-use
> that mount's transport and the mount options that control it.

That's good to know, Chuck, thanks.

In this case, though, I'm seeing the behaviour with only this single NFS mount on my test client.

Andrew



2023-01-26 15:31:39

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Andrew Klaassen <[email protected]>
> Sent: Monday, January 23, 2023 11:31 AM
>
> Hello,
>
> There's a specific NFSv4 mount on a specific machine which we'd like to
> timeout and return an error after a few seconds if the server goes away.
>
> I've confirmed the following on two different kernels, 4.18.0-
> 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
>
> I've been able to get both autofs and the mount command to cooperate, so
> that the mount attempt fails after an arbitrary number of seconds. This
> mount command, for example, will fail after 6 seconds, as expected based on
> the timeo=20,retrans=2,retry=0 options:
>
> $ time sudo mount -t nfs4 -o
> rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmin
> =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,retran
> s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> mount.nfs4: Connection timed out
>
> real 0m6.084s
> user 0m0.007s
> sys 0m0.015s
>
> However, if the share is already mounted and the server goes away, the
> timeout is always 2 minutes plus the time I expect based on timeo and
> retrans. In this case, 2 minutes and 6 seconds:
>
> $ time ls /mnt/thor04
> ls: cannot access '/mnt/thor04': Connection timed out
>
> real 2m6.025s
> user 0m0.003s
> sys 0m0.000s
>
> Watching the outgoing packets in the second case, the pattern is always the
> same:
> - 0.2 seconds between the first two, then doubling each time until the two
> minute mark is exceeded (so the last NFS packet, which is always the 11th
> packet, is sent around 1:45 after the first).
> - Then some generic packets that start exactly-ish on the two minute mark, 1
> second between the first two, then doubling each time. (By this time the
> NFS command has given up.)
>
> 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306 ecr
> 1589769203], length 200: NFS request xid 3614904256 196 getattr fh 0,2/53
> 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951,
> win 64240, options [mss 1460,sackOK,TS val 835010130 ecr 0,nop,wscale 7],
> length 0
> 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951,
> win 64240, options [mss 1460,sackOK,TS val 835011155 ecr 0,nop,wscale 7],
> length 0
> 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951,
> win 64240, options [mss 1460,sackOK,TS val 835013202 ecr 0,nop,wscale 7],
> length 0
> 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951,
> win 64240, options [mss 1460,sackOK,TS val 835017234 ecr 0,nop,wscale 7],
> length 0
> 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951,
> win 64240, options [mss 1460,sackOK,TS val 835025490 ecr 0,nop,wscale 7],
> length 0
> 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951,
> win 64240, options [mss 1460,sackOK,TS val 835041874 ecr 0,nop,wscale 7],
> length 0
> 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq 1375256951,
> win 64240, options [mss 1460,sackOK,TS val 835074130 ecr 0,nop,wscale 7],
> length 0
>
> I tried changing tcp_retries2 as suggested in another thread from this list:
>
> # echo 3 > /proc/sys/net/ipv4/tcp_retries2
>
> ...but it made no difference on either kernel. The 2 minute timeout also
> doesn't seem to match with what I'd calculate from the initial value of
> tcp_retries2, which should give a much higher timeout.
>
> The only clue I've been able to find is in the retry=n entry in the NFS
> manpage:
>
> " For TCP the default is 3 minutes, but system TCP connection timeouts will
> sometimes limit the timeout of each retransmission to around 2 minutes."
>
> What I'm not able to make sense of:
> - The retry option says that it applies to mount operations, not read/write
> operations. However, in this case I'm seeing the 2 minute delay on
> read/write operations but *not* mount operations.
> - A couple of hours of searching didn't lead me to any kernel settings that
> would result in a 2 minute timeout.
>
> Does anyone have any clues about a) what's happening and b) how to get
> our desired behaviour of being able to control both mount and read/write
> timeouts down to a few seconds?
>
> Thanks.

I thought that changing TCP_RTO_MAX in include/net/tcp.h from 120 to something smaller and recompiling the kernel would change the 2 minute timeout, but it had no effect. I'm going to keep poking through the kernel code to see if there's a knob I can turn to change the 2 minute timeout, so that I can at least understand where it's coming from.

Any hints as to where I should be looking?

Andrew



2023-01-26 22:08:07

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Andrew Klaassen <[email protected]>
> Sent: Thursday, January 26, 2023 10:32 AM
>
> > From: Andrew Klaassen <[email protected]>
> > Sent: Monday, January 23, 2023 11:31 AM
> >
> > Hello,
> >
> > There's a specific NFSv4 mount on a specific machine which we'd like
> > to timeout and return an error after a few seconds if the server goes away.
> >
> > I've confirmed the following on two different kernels, 4.18.0-
> > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> >
> > I've been able to get both autofs and the mount command to cooperate,
> > so that the mount attempt fails after an arbitrary number of seconds.
> > This mount command, for example, will fail after 6 seconds, as
> > expected based on the timeo=20,retrans=2,retry=0 options:
> >
> > $ time sudo mount -t nfs4 -o
> > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmi
> > n
> >
> =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,retra
> > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > mount.nfs4: Connection timed out
> >
> > real 0m6.084s
> > user 0m0.007s
> > sys 0m0.015s
> >
> > However, if the share is already mounted and the server goes away, the
> > timeout is always 2 minutes plus the time I expect based on timeo and
> > retrans. In this case, 2 minutes and 6 seconds:
> >
> > $ time ls /mnt/thor04
> > ls: cannot access '/mnt/thor04': Connection timed out
> >
> > real 2m6.025s
> > user 0m0.003s
> > sys 0m0.000s
> >
> > Watching the outgoing packets in the second case, the pattern is
> > always the
> > same:
> > - 0.2 seconds between the first two, then doubling each time until
> > the two minute mark is exceeded (so the last NFS packet, which is
> > always the 11th packet, is sent around 1:45 after the first).
> > - Then some generic packets that start exactly-ish on the two minute
> > mark, 1 second between the first two, then doubling each time. (By
> > this time the NFS command has given up.)
> >
> > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835010130 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835011155 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835013202 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835017234 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835025490 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835041874 ecr
> > 0,nop,wscale 7], length 0
> > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835074130 ecr
> > 0,nop,wscale 7], length 0
> >
> > I tried changing tcp_retries2 as suggested in another thread from this list:
> >
> > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> >
> > ...but it made no difference on either kernel. The 2 minute timeout
> > also doesn't seem to match with what I'd calculate from the initial
> > value of tcp_retries2, which should give a much higher timeout.
> >
> > The only clue I've been able to find is in the retry=n entry in the
> > NFS
> > manpage:
> >
> > " For TCP the default is 3 minutes, but system TCP connection timeouts
> > will sometimes limit the timeout of each retransmission to around 2
> minutes."
> >
> > What I'm not able to make sense of:
> > - The retry option says that it applies to mount operations, not
> > read/write operations. However, in this case I'm seeing the 2 minute
> > delay on read/write operations but *not* mount operations.
> > - A couple of hours of searching didn't lead me to any kernel
> > settings that would result in a 2 minute timeout.
> >
> > Does anyone have any clues about a) what's happening and b) how to get
> > our desired behaviour of being able to control both mount and
> > read/write timeouts down to a few seconds?
> >
> > Thanks.
>
> I thought that changing TCP_RTO_MAX in include/net/tcp.h from 120 to
> something smaller and recompiling the kernel would change the 2 minute
> timeout, but it had no effect. I'm going to keep poking through the kernel
> code to see if there's a knob I can turn to change the 2 minute timeout, so
> that I can at least understand where it's coming from.
>
> Any hints as to where I should be looking?

I believe I've made some progress with this today:

- Calls to rpc_create() from fs/nfs/client.c are sending an rpc_timeout struct with their args.
- rpc_create() does *not* pass the timeout on to xprt_create_transport(), which then can't pass it on to xs_setup_tcp().
- xs_setup_tcp(), having no timeout passed to it, uses xs_tcp_default_timeout instead.
- changing xs_tcp_default_timeout changes the "ls" timeout behaviour I described above.

In theory all of this means that the timeout simply needs to be passed through and used instead of xs_tcp_default_timeout. I'm going to give this a try tomorrow.

Here's what I'm going to try first; I'm no C programmer, though, so any advice or corrections you might have would be appreciated.

Thanks.

Andrew

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0b0b9f1eed46..1350c1f489f7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
.addrlen = args->addrsize,
.servername = args->servername,
.bc_xprt = args->bc_xprt,
+ .timeout = args->timeout,
};
char servername[48];
struct rpc_clnt *clnt;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aaa5b2741b79..adc79d94b59e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -3003,7 +3003,7 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
xprt->idle_timeout = XS_IDLE_DISC_TO;

xprt->ops = &xs_tcp_ops;
- xprt->timeout = &xs_tcp_default_timeout;
+ xprt->timeout = args->timeout;

xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
xprt->connect_timeout = xprt->timeout->to_initval *


2023-01-27 13:33:19

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:
> > From: Andrew Klaassen <[email protected]>
> > Sent: Thursday, January 26, 2023 10:32 AM
> >
> > > From: Andrew Klaassen <[email protected]>
> > > Sent: Monday, January 23, 2023 11:31 AM
> > >
> > > Hello,
> > >
> > > There's a specific NFSv4 mount on a specific machine which we'd
> > > like
> > > to timeout and return an error after a few seconds if the server
> > > goes away.
> > >
> > > I've confirmed the following on two different kernels, 4.18.0-
> > > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> > >
> > > I've been able to get both autofs and the mount command to
> > > cooperate,
> > > so that the mount attempt fails after an arbitrary number of
> > > seconds.
> > > This mount command, for example, will fail after 6 seconds, as
> > > expected based on the timeo=20,retrans=2,retry=0 options:
> > >
> > > $ time sudo mount -t nfs4 -o
> > > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acr
> > > egmi
> > > n
> > >
> > =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,ret
> > ra
> > > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > > mount.nfs4: Connection timed out
> > >
> > > real 0m6.084s
> > > user 0m0.007s
> > > sys 0m0.015s
> > >
> > > However, if the share is already mounted and the server goes away,
> > > the
> > > timeout is always 2 minutes plus the time I expect based on timeo
> > > and
> > > retrans. In this case, 2 minutes and 6 seconds:
> > >
> > > $ time ls /mnt/thor04
> > > ls: cannot access '/mnt/thor04': Connection timed out
> > >
> > > real 2m6.025s
> > > user 0m0.003s
> > > sys 0m0.000s
> > >
> > > Watching the outgoing packets in the second case, the pattern is
> > > always the
> > > same:
> > > ?- 0.2 seconds between the first two, then doubling each time
> > > until
> > > the two minute mark is exceeded (so the last NFS packet, which is
> > > always the 11th packet, is sent around 1:45 after the first).
> > > ?- Then some generic packets that start exactly-ish on the two
> > > minute
> > > mark, 1 second between the first two, then doubling each time.
> > > (By
> > > this time the NFS command has given up.)
> > >
> > > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > seq
> > > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306
> > > ecr
> > > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > > 0,2/53
> > > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > seq
> > > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835010130
> > > ecr
> > > 0,nop,wscale 7], length 0
> > > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > seq
> > > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835011155
> > > ecr
> > > 0,nop,wscale 7], length 0
> > > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > seq
> > > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835013202
> > > ecr
> > > 0,nop,wscale 7], length 0
> > > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > seq
> > > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835017234
> > > ecr
> > > 0,nop,wscale 7], length 0
> > > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > seq
> > > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835025490
> > > ecr
> > > 0,nop,wscale 7], length 0
> > > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > seq
> > > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835041874
> > > ecr
> > > 0,nop,wscale 7], length 0
> > > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > seq
> > > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835074130
> > > ecr
> > > 0,nop,wscale 7], length 0
> > >
> > > I tried changing tcp_retries2 as suggested in another thread from
> > > this list:
> > >
> > > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> > >
> > > ...but it made no difference on either kernel. The 2 minute
> > > timeout
> > > also doesn't seem to match with what I'd calculate from the
> > > initial
> > > value of tcp_retries2, which should give a much higher timeout.
> > >
> > > The only clue I've been able to find is in the retry=n entry in
> > > the
> > > NFS
> > > manpage:
> > >
> > > " For TCP the default is 3 minutes, but system TCP connection
> > > timeouts
> > > will sometimes limit the timeout of each retransmission to around
> > > 2
> > minutes."
> > >
> > > What I'm not able to make sense of:
> > > ?- The retry option says that it applies to mount operations, not
> > > read/write operations. However, in this case I'm seeing the 2
> > > minute
> > > delay on read/write operations but *not* mount operations.
> > > ?- A couple of hours of searching didn't lead me to any kernel
> > > settings that would result in a 2 minute timeout.
> > >
> > > Does anyone have any clues about a) what's happening and b) how to
> > > get
> > > our desired behaviour of being able to control both mount and
> > > read/write timeouts down to a few seconds?
> > >
> > > Thanks.
> >
> > I thought that changing TCP_RTO_MAX in include/net/tcp.h from 120 to
> > something smaller and recompiling the kernel would change the 2
> > minute
> > timeout, but it had no effect. I'm going to keep poking through the
> > kernel
> > code to see if there's a knob I can turn to change the 2 minute
> > timeout, so
> > that I can at least understand where it's coming from.
> >
> > Any hints as to where I should be looking?
>
> I believe I've made some progress with this today:
>
> ?- Calls to rpc_create() from fs/nfs/client.c are sending an
> rpc_timeout struct with their args.
> ?- rpc_create() does *not* pass the timeout on to
> xprt_create_transport(), which then can't pass it on to
> xs_setup_tcp().
> ?- xs_setup_tcp(), having no timeout passed to it, uses
> xs_tcp_default_timeout instead.
> ?- changing xs_tcp_default_timeout changes the "ls" timeout behaviour
> I described above.
>
> In theory all of this means that the timeout simply needs to be passed
> through and used instead of xs_tcp_default_timeout. I'm going to give
> this a try tomorrow.
>

That's a great root-cause analysis. The interlocking timeouts involved
with NFS and its sockets can be really difficult to unwind.

Is there a way to automate this testcase? That might be nice to have in
xfstests or the nfstest suite.

> Here's what I'm going to try first; I'm no C programmer, though, so
> any advice or corrections you might have would be appreciated.
>
> Thanks.
>
> Andrew
>
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 0b0b9f1eed46..1350c1f489f7 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args
> *args)
> ????????????????.addrlen = args->addrsize,
> ????????????????.servername = args->servername,
> ????????????????.bc_xprt = args->bc_xprt,
> + .timeout = args->timeout,
> ????????};
> ????????char servername[48];
> ????????struct rpc_clnt *clnt;
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index aaa5b2741b79..adc79d94b59e 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -3003,7 +3003,7 @@ static struct rpc_xprt *xs_setup_tcp(struct
> xprt_create *args)
> ????????xprt->idle_timeout = XS_IDLE_DISC_TO;
>
> ????????xprt->ops = &xs_tcp_ops;
> - xprt->timeout = &xs_tcp_default_timeout;
> + xprt->timeout = args->timeout;
>
> ????????xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
> ????????xprt->connect_timeout = xprt->timeout->to_initval *
>

Looks like you're probably on the right track. You're missing a few
things:

You'll need to add a "timeout" field to struct xprt_create in
include/linux/sunrpc/xprt.h, and there may be some other places that
either need to set the timeout in that structure, or do something with
that field when it's set.

Once you have something that fixes your reproducer, go ahead and post it
and we can help you work through whatever changes need to me made to
make it work.

Nice work!
--
Jeff Layton <[email protected]>

2023-01-30 19:33:37

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Jeff Layton <[email protected]>
> Sent: Friday, January 27, 2023 8:33 AM
>
> On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:
> > > From: Andrew Klaassen <[email protected]>
> > > Sent: Thursday, January 26, 2023 10:32 AM
> > >
> > > > From: Andrew Klaassen <[email protected]>
> > > > Sent: Monday, January 23, 2023 11:31 AM
> > > >
> > > > Hello,
> > > >
> > > > There's a specific NFSv4 mount on a specific machine which we'd
> > > > like to timeout and return an error after a few seconds if the
> > > > server goes away.
> > > >
> > > > I've confirmed the following on two different kernels, 4.18.0-
> > > > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> > > >
> > > > I've been able to get both autofs and the mount command to
> > > > cooperate, so that the mount attempt fails after an arbitrary
> > > > number of seconds.
> > > > This mount command, for example, will fail after 6 seconds, as
> > > > expected based on the timeo=20,retrans=2,retry=0 options:
> > > >
> > > > $ time sudo mount -t nfs4 -o
> > > > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acr
> > > > egmi
> > > > n
> > > >
> > > =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,ret
> > > ra
> > > > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > > > mount.nfs4: Connection timed out
> > > >
> > > > real 0m6.084s
> > > > user 0m0.007s
> > > > sys 0m0.015s
> > > >
> > > > However, if the share is already mounted and the server goes away,
> > > > the timeout is always 2 minutes plus the time I expect based on
> > > > timeo and retrans. In this case, 2 minutes and 6 seconds:
> > > >
> > > > $ time ls /mnt/thor04
> > > > ls: cannot access '/mnt/thor04': Connection timed out
> > > >
> > > > real 2m6.025s
> > > > user 0m0.003s
> > > > sys 0m0.000s
> > > >
> > > > Watching the outgoing packets in the second case, the pattern is
> > > > always the
> > > > same:
> > > > - 0.2 seconds between the first two, then doubling each time
> > > > until the two minute mark is exceeded (so the last NFS packet,
> > > > which is always the 11th packet, is sent around 1:45 after the
> > > > first).
> > > > - Then some generic packets that start exactly-ish on the two
> > > > minute mark, 1 second between the first two, then doubling each
> > > > time.
> > > > (By
> > > > this time the NFS command has given up.)
> > > >
> > > > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834889483 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834889690 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834889898 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834890306 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834891154 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834892818 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834896082 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834902610 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834915922 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834942034 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
> > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > 834996306 ecr 1589769203], length 200: NFS request xid 3614904256
> > > > 196 getattr fh
> > > > 0,2/53
> > > > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > 835010130 ecr 0,nop,wscale 7], length 0
> > > > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > 835011155 ecr 0,nop,wscale 7], length 0
> > > > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > 835013202 ecr 0,nop,wscale 7], length 0
> > > > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > 835017234 ecr 0,nop,wscale 7], length 0
> > > > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > 835025490 ecr 0,nop,wscale 7], length 0
> > > > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > 835041874 ecr 0,nop,wscale 7], length 0
> > > > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
> > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > 835074130 ecr 0,nop,wscale 7], length 0
> > > >
> > > > I tried changing tcp_retries2 as suggested in another thread from
> > > > this list:
> > > >
> > > > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> > > >
> > > > ...but it made no difference on either kernel. The 2 minute
> > > > timeout also doesn't seem to match with what I'd calculate from
> > > > the initial value of tcp_retries2, which should give a much higher
> > > > timeout.
> > > >
> > > > The only clue I've been able to find is in the retry=n entry in
> > > > the NFS
> > > > manpage:
> > > >
> > > > " For TCP the default is 3 minutes, but system TCP connection
> > > > timeouts will sometimes limit the timeout of each retransmission
> > > > to around
> > > > 2
> > > minutes."
> > > >
> > > > What I'm not able to make sense of:
> > > > - The retry option says that it applies to mount operations, not
> > > > read/write operations. However, in this case I'm seeing the 2
> > > > minute delay on read/write operations but *not* mount operations.
> > > > - A couple of hours of searching didn't lead me to any kernel
> > > > settings that would result in a 2 minute timeout.
> > > >
> > > > Does anyone have any clues about a) what's happening and b) how to
> > > > get our desired behaviour of being able to control both mount and
> > > > read/write timeouts down to a few seconds?
> > > >
> > > > Thanks.
> > >
> > > I thought that changing TCP_RTO_MAX in include/net/tcp.h from 120 to
> > > something smaller and recompiling the kernel would change the 2
> > > minute timeout, but it had no effect. I'm going to keep poking
> > > through the kernel code to see if there's a knob I can turn to
> > > change the 2 minute timeout, so that I can at least understand where
> > > it's coming from.
> > >
> > > Any hints as to where I should be looking?
> >
> > I believe I've made some progress with this today:
> >
> > - Calls to rpc_create() from fs/nfs/client.c are sending an
> > rpc_timeout struct with their args.
> > - rpc_create() does *not* pass the timeout on to
> > xprt_create_transport(), which then can't pass it on to
> > xs_setup_tcp().
> > - xs_setup_tcp(), having no timeout passed to it, uses
> > xs_tcp_default_timeout instead.
> > - changing xs_tcp_default_timeout changes the "ls" timeout behaviour
> > I described above.
> >
> > In theory all of this means that the timeout simply needs to be passed
> > through and used instead of xs_tcp_default_timeout. I'm going to give
> > this a try tomorrow.
> >
>
> That's a great root-cause analysis. The interlocking timeouts involved with
> NFS and its sockets can be really difficult to unwind.
>
> Is there a way to automate this testcase? That might be nice to have in
> xfstests or the nfstest suite.
>
> > Here's what I'm going to try first; I'm no C programmer, though, so
> > any advice or corrections you might have would be appreciated.
> >
> > Thanks.
> >
> > Andrew
> >
> > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index
> > 0b0b9f1eed46..1350c1f489f7 100644
> > --- a/net/sunrpc/clnt.c
> > +++ b/net/sunrpc/clnt.c
> > @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args
> > *args)
> > .addrlen = args->addrsize,
> > .servername = args->servername,
> > .bc_xprt = args->bc_xprt,
> > + .timeout = args->timeout,
> > };
> > char servername[48];
> > struct rpc_clnt *clnt;
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index
> > aaa5b2741b79..adc79d94b59e 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -3003,7 +3003,7 @@ static struct rpc_xprt *xs_setup_tcp(struct
> > xprt_create *args)
> > xprt->idle_timeout = XS_IDLE_DISC_TO;
> >
> > xprt->ops = &xs_tcp_ops;
> > - xprt->timeout = &xs_tcp_default_timeout;
> > + xprt->timeout = args->timeout;
> >
> > xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
> > xprt->connect_timeout = xprt->timeout->to_initval *
> >
>
> Looks like you're probably on the right track. You're missing a few
> things:
>
> You'll need to add a "timeout" field to struct xprt_create in
> include/linux/sunrpc/xprt.h, and there may be some other places that either
> need to set the timeout in that structure, or do something with that field
> when it's set.
>
> Once you have something that fixes your reproducer, go ahead and post it
> and we can help you work through whatever changes need to me made to
> make it work.
>
> Nice work!

Thanks for the tip, that was helpful.

Currently I'm fighting with kernel recompilation. I decided to make it quicker by slimming down the config, but apparently I've achieved something which Google claims no one else has achieved:

Errors on kernel make modules_install:

DEPMOD /lib/modules/6.2.0-rc5-sunrpctimeo+
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs4_disable_idmapping
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs4_label_alloc
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol send_implementation_id
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs_atomic_open
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs_clear_verifier_delegated
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs4_client_id_uniquifier
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs4_dentry_operations
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs_fscache_open_file
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs4_fs_type
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol recover_lost_locks
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs_callback_nr_threads
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol max_session_cb_slots
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol max_session_slots
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs_idmap_cache_timeout
depmod: WARNING: /lib/modules/6.2.0-rc5-sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs_callback_set_tcpport

Errors on module load:

[ 94.008271] nfsv4: Unknown symbol nfs4_disable_idmapping (err -2)
[ 94.008321] nfsv4: Unknown symbol nfs4_label_alloc (err -2)
[ 94.008434] nfsv4: Unknown symbol send_implementation_id (err -2)
[ 94.008446] nfsv4: Unknown symbol nfs_atomic_open (err -2)
[ 94.008468] nfsv4: Unknown symbol nfs_clear_verifier_delegated (err -2)
[ 94.008475] nfsv4: Unknown symbol nfs4_client_id_uniquifier (err -2)
[ 94.008501] nfsv4: Unknown symbol nfs4_dentry_operations (err -2)
[ 94.008521] nfsv4: Unknown symbol nfs_fscache_open_file (err -2)
[ 94.008566] nfsv4: Unknown symbol nfs4_fs_type (err -2)
[ 94.008595] nfsv4: Unknown symbol recover_lost_locks (err -2)
[ 94.008639] nfsv4: Unknown symbol nfs_callback_nr_threads (err -2)
[ 94.008654] nfsv4: Unknown symbol max_session_cb_slots (err -2)
[ 94.008678] nfsv4: Unknown symbol max_session_slots (err -2)
[ 94.008694] nfsv4: Unknown symbol nfs_idmap_cache_timeout (err -2)
[ 94.008709] nfsv4: Unknown symbol nfs_callback_set_tcpport (err -2)

I suspect I've turned something off in the config that I shouldn't have, but I'm not sure what. I see that one of the symbols (nfs_clear_verifier_delegated) is in include/linux/nfs_fs.h, and the others are defined in fs/nfs/nfs4_fs.h, fs/nfs/super.c, fs/nfs/dir.c, fs/nfs/inode.c, fs/nfs/fscache.c, and fs/nfs/fs_context.c. I'm changing config options and recompiling to try to figure out what I'm missing, but at a couple of hours per compile and only a couple of days a week to work on this it's slow going. Any hints as to what I might be doing wrong would be appreciated. ????

Andrew



2023-01-30 19:55:48

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

On Mon, 2023-01-30 at 19:33 +0000, Andrew Klaassen wrote:
> > From: Jeff Layton <[email protected]>
> > Sent: Friday, January 27, 2023 8:33 AM
> >
> > On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:
> > > > From: Andrew Klaassen <[email protected]>
> > > > Sent: Thursday, January 26, 2023 10:32 AM
> > > >
> > > > > From: Andrew Klaassen <[email protected]>
> > > > > Sent: Monday, January 23, 2023 11:31 AM
> > > > >
> > > > > Hello,
> > > > >
> > > > > There's a specific NFSv4 mount on a specific machine which
> > > > > we'd
> > > > > like to timeout and return an error after a few seconds if the
> > > > > server goes away.
> > > > >
> > > > > I've confirmed the following on two different kernels, 4.18.0-
> > > > > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> > > > >
> > > > > I've been able to get both autofs and the mount command to
> > > > > cooperate, so that the mount attempt fails after an arbitrary
> > > > > number of seconds.
> > > > > This mount command, for example, will fail after 6 seconds, as
> > > > > expected based on the timeo=20,retrans=2,retry=0 options:
> > > > >
> > > > > $ time sudo mount -t nfs4 -o
> > > > > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255
> > > > > ,acr
> > > > > egmi
> > > > > n
> > > > >
> > > > =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20
> > > > ,ret
> > > > ra
> > > > > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > > > > mount.nfs4: Connection timed out
> > > > >
> > > > > real 0m6.084s
> > > > > user 0m0.007s
> > > > > sys 0m0.015s
> > > > >
> > > > > However, if the share is already mounted and the server goes
> > > > > away,
> > > > > the timeout is always 2 minutes plus the time I expect based
> > > > > on
> > > > > timeo and retrans. In this case, 2 minutes and 6 seconds:
> > > > >
> > > > > $ time ls /mnt/thor04
> > > > > ls: cannot access '/mnt/thor04': Connection timed out
> > > > >
> > > > > real 2m6.025s
> > > > > user 0m0.003s
> > > > > sys 0m0.000s
> > > > >
> > > > > Watching the outgoing packets in the second case, the pattern
> > > > > is
> > > > > always the
> > > > > same:
> > > > >  - 0.2 seconds between the first two, then doubling each time
> > > > > until the two minute mark is exceeded (so the last NFS packet,
> > > > > which is always the 11th packet, is sent around 1:45 after the
> > > > > first).
> > > > >  - Then some generic packets that start exactly-ish on the two
> > > > > minute mark, 1 second between the first two, then doubling
> > > > > each
> > > > > time.
> > > > > (By
> > > > > this time the NFS command has given up.)
> > > > >
> > > > > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834889483 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834889690 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834889898 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834890306 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834891154 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834892818 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834896082 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834902610 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834915922 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834942034 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > [P.],
> > > > > seq 14452:14652, ack 18561, win 501, options [nop,nop,TS val
> > > > > 834996306 ecr 1589769203], length 200: NFS request xid
> > > > > 3614904256
> > > > > 196 getattr fh
> > > > > 0,2/53
> > > > > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > [S],
> > > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > > 835010130 ecr 0,nop,wscale 7], length 0
> > > > > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > [S],
> > > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > > 835011155 ecr 0,nop,wscale 7], length 0
> > > > > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > [S],
> > > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > > 835013202 ecr 0,nop,wscale 7], length 0
> > > > > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > [S],
> > > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > > 835017234 ecr 0,nop,wscale 7], length 0
> > > > > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > [S],
> > > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > > 835025490 ecr 0,nop,wscale 7], length 0
> > > > > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > [S],
> > > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > > 835041874 ecr 0,nop,wscale 7], length 0
> > > > > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > [S],
> > > > > seq 1375256951, win 64240, options [mss 1460,sackOK,TS val
> > > > > 835074130 ecr 0,nop,wscale 7], length 0
> > > > >
> > > > > I tried changing tcp_retries2 as suggested in another thread
> > > > > from
> > > > > this list:
> > > > >
> > > > > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> > > > >
> > > > > ...but it made no difference on either kernel. The 2 minute
> > > > > timeout also doesn't seem to match with what I'd calculate
> > > > > from
> > > > > the initial value of tcp_retries2, which should give a much
> > > > > higher
> > > > > timeout.
> > > > >
> > > > > The only clue I've been able to find is in the retry=n entry
> > > > > in
> > > > > the NFS
> > > > > manpage:
> > > > >
> > > > > " For TCP the default is 3 minutes, but system TCP connection
> > > > > timeouts will sometimes limit the timeout of each
> > > > > retransmission
> > > > > to around
> > > > > 2
> > > > minutes."
> > > > >
> > > > > What I'm not able to make sense of:
> > > > >  - The retry option says that it applies to mount operations,
> > > > > not
> > > > > read/write operations. However, in this case I'm seeing the 2
> > > > > minute delay on read/write operations but *not* mount
> > > > > operations.
> > > > >  - A couple of hours of searching didn't lead me to any kernel
> > > > > settings that would result in a 2 minute timeout.
> > > > >
> > > > > Does anyone have any clues about a) what's happening and b)
> > > > > how to
> > > > > get our desired behaviour of being able to control both mount
> > > > > and
> > > > > read/write timeouts down to a few seconds?
> > > > >
> > > > > Thanks.
> > > >
> > > > I thought that changing TCP_RTO_MAX in include/net/tcp.h from
> > > > 120 to
> > > > something smaller and recompiling the kernel would change the 2
> > > > minute timeout, but it had no effect. I'm going to keep poking
> > > > through the kernel code to see if there's a knob I can turn to
> > > > change the 2 minute timeout, so that I can at least understand
> > > > where
> > > > it's coming from.
> > > >
> > > > Any hints as to where I should be looking?
> > >
> > > I believe I've made some progress with this today:
> > >
> > >  - Calls to rpc_create() from fs/nfs/client.c are sending an
> > > rpc_timeout struct with their args.
> > >  - rpc_create() does *not* pass the timeout on to
> > > xprt_create_transport(), which then can't pass it on to
> > > xs_setup_tcp().
> > >  - xs_setup_tcp(), having no timeout passed to it, uses
> > > xs_tcp_default_timeout instead.
> > >  - changing xs_tcp_default_timeout changes the "ls" timeout
> > > behaviour
> > > I described above.
> > >
> > > In theory all of this means that the timeout simply needs to be
> > > passed
> > > through and used instead of xs_tcp_default_timeout. I'm going to
> > > give
> > > this a try tomorrow.
> > >
> >
> > That's a great root-cause analysis. The interlocking timeouts
> > involved with
> > NFS and its sockets can be really difficult to unwind.
> >
> > Is there a way to automate this testcase? That might be nice to have
> > in
> > xfstests or the nfstest suite.
> >
> > > Here's what I'm going to try first; I'm no C programmer, though,
> > > so
> > > any advice or corrections you might have would be appreciated.
> > >
> > > Thanks.
> > >
> > > Andrew
> > >
> > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index
> > > 0b0b9f1eed46..1350c1f489f7 100644
> > > --- a/net/sunrpc/clnt.c
> > > +++ b/net/sunrpc/clnt.c
> > > @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct
> > > rpc_create_args
> > > *args)
> > >                 .addrlen = args->addrsize,
> > >                 .servername = args->servername,
> > >                 .bc_xprt = args->bc_xprt,
> > > + .timeout = args->timeout,
> > >         };
> > >         char servername[48];
> > >         struct rpc_clnt *clnt;
> > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index
> > > aaa5b2741b79..adc79d94b59e 100644
> > > --- a/net/sunrpc/xprtsock.c
> > > +++ b/net/sunrpc/xprtsock.c
> > > @@ -3003,7 +3003,7 @@ static struct rpc_xprt *xs_setup_tcp(struct
> > > xprt_create *args)
> > >         xprt->idle_timeout = XS_IDLE_DISC_TO;
> > >
> > >         xprt->ops = &xs_tcp_ops;
> > > - xprt->timeout = &xs_tcp_default_timeout;
> > > + xprt->timeout = args->timeout;
> > >
> > >         xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
> > >         xprt->connect_timeout = xprt->timeout->to_initval *
> > >
> >
> > Looks like you're probably on the right track. You're missing a few
> > things:
> >
> > You'll need to add a "timeout" field to struct xprt_create in
> > include/linux/sunrpc/xprt.h, and there may be some other places that
> > either
> > need to set the timeout in that structure, or do something with that
> > field
> > when it's set.
> >
> > Once you have something that fixes your reproducer, go ahead and
> > post it
> > and we can help you work through whatever changes need to me made to
> > make it work.
> >
> > Nice work!
>
> Thanks for the tip, that was helpful.
>
> Currently I'm fighting with kernel recompilation. I decided to make
> it quicker by slimming down the config, but apparently I've achieved
> something which Google claims no one else has achieved:
>
> Errors on kernel make modules_install:
>
>   DEPMOD /lib/modules/6.2.0-rc5-sunrpctimeo+
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs4_disable_idmapping
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs4_label_alloc
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> send_implementation_id
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs_atomic_open
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs_clear_verifier_delegated
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs4_client_id_uniquifier
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs4_dentry_operations
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs_fscache_open_file
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol nfs4_fs_type
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> recover_lost_locks
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs_callback_nr_threads
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> max_session_cb_slots
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> max_session_slots
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs_idmap_cache_timeout
> depmod: WARNING: /lib/modules/6.2.0-rc5-
> sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs_callback_set_tcpport
>
> Errors on module load:
>
> [ 94.008271] nfsv4: Unknown symbol nfs4_disable_idmapping (err -2)
> [ 94.008321] nfsv4: Unknown symbol nfs4_label_alloc (err -2)
> [ 94.008434] nfsv4: Unknown symbol send_implementation_id (err -2)
> [ 94.008446] nfsv4: Unknown symbol nfs_atomic_open (err -2)
> [ 94.008468] nfsv4: Unknown symbol nfs_clear_verifier_delegated (err
> -2)
> [ 94.008475] nfsv4: Unknown symbol nfs4_client_id_uniquifier (err -
> 2)
> [ 94.008501] nfsv4: Unknown symbol nfs4_dentry_operations (err -2)
> [ 94.008521] nfsv4: Unknown symbol nfs_fscache_open_file (err -2)
> [ 94.008566] nfsv4: Unknown symbol nfs4_fs_type (err -2)
> [ 94.008595] nfsv4: Unknown symbol recover_lost_locks (err -2)
> [ 94.008639] nfsv4: Unknown symbol nfs_callback_nr_threads (err -2)
> [ 94.008654] nfsv4: Unknown symbol max_session_cb_slots (err -2)
> [ 94.008678] nfsv4: Unknown symbol max_session_slots (err -2)
> [ 94.008694] nfsv4: Unknown symbol nfs_idmap_cache_timeout (err -2)
> [ 94.008709] nfsv4: Unknown symbol nfs_callback_set_tcpport (err -2)
>
> I suspect I've turned something off in the config that I shouldn't
> have, but I'm not sure what. I see that one of the symbols
> (nfs_clear_verifier_delegated) is in include/linux/nfs_fs.h, and the
> others are defined in fs/nfs/nfs4_fs.h, fs/nfs/super.c, fs/nfs/dir.c,
> fs/nfs/inode.c, fs/nfs/fscache.c, and fs/nfs/fs_context.c. I'm
> changing config options and recompiling to try to figure out what I'm
> missing, but at a couple of hours per compile and only a couple of
> days a week to work on this it's slow going. Any hints as to what I
> might be doing wrong would be appreciated. ????
>

Looks like the ABI got broken when you turned off some options.

Generally, if you just want to build a single module, then you want the
.config to be _exactly_ the one that you used to build the kernel you're
going to plug it into. Then to build the modules under fs/nfs you can
do:

make modules_prepare
make M=fs/nfs

...and then drop the resulting .ko objects into the right place in
/lib/modules.

That said, it may be simpler to just build and work with a whole kernel
for testing purposes. Working with an individual kmod can be a bit
tricky unless you know what you're doing.

Once you do the first, subsequent builds should be reasonably fast.
--
Jeff Layton <[email protected]>

2023-01-30 20:03:20

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Jeff Layton <[email protected]>
> Sent: Monday, January 30, 2023 2:56 PM
>
> On Mon, 2023-01-30 at 19:33 +0000, Andrew Klaassen wrote:
> > > From: Jeff Layton <[email protected]>
> > > Sent: Friday, January 27, 2023 8:33 AM
> > >
> > > On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:
> > > > > From: Andrew Klaassen <[email protected]>
> > > > > Sent: Thursday, January 26, 2023 10:32 AM
> > > > >
> > > > > > From: Andrew Klaassen <[email protected]>
> > > > > > Sent: Monday, January 23, 2023 11:31 AM
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > There's a specific NFSv4 mount on a specific machine which
> > > > > > we'd like to timeout and return an error after a few seconds
> > > > > > if the server goes away.
> > > > > >
> > > > > > I've confirmed the following on two different kernels, 4.18.0-
> > > > > > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> > > > > >
> > > > > > I've been able to get both autofs and the mount command to
> > > > > > cooperate, so that the mount attempt fails after an arbitrary
> > > > > > number of seconds.
> > > > > > This mount command, for example, will fail after 6 seconds, as
> > > > > > expected based on the timeo=20,retrans=2,retry=0 options:
> > > > > >
> > > > > > $ time sudo mount -t nfs4 -o
> > > > > > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255
> > > > > > ,acr
> > > > > > egmi
> > > > > > n
> > > > > >
> > > > > =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20
> > > > > ,ret
> > > > > ra
> > > > > > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > > > > > mount.nfs4: Connection timed out
> > > > > >
> > > > > > real 0m6.084s
> > > > > > user 0m0.007s
> > > > > > sys 0m0.015s
> > > > > >
> > > > > > However, if the share is already mounted and the server goes
> > > > > > away, the timeout is always 2 minutes plus the time I expect
> > > > > > based on timeo and retrans. In this case, 2 minutes and 6
> > > > > > seconds:
> > > > > >
> > > > > > $ time ls /mnt/thor04
> > > > > > ls: cannot access '/mnt/thor04': Connection timed out
> > > > > >
> > > > > > real 2m6.025s
> > > > > > user 0m0.003s
> > > > > > sys 0m0.000s
> > > > > >
> > > > > > Watching the outgoing packets in the second case, the pattern
> > > > > > is always the
> > > > > > same:
> > > > > > - 0.2 seconds between the first two, then doubling each time
> > > > > > until the two minute mark is exceeded (so the last NFS packet,
> > > > > > which is always the 11th packet, is sent around 1:45 after the
> > > > > > first).
> > > > > > - Then some generic packets that start exactly-ish on the two
> > > > > > minute mark, 1 second between the first two, then doubling
> > > > > > each time.
> > > > > > (By
> > > > > > this time the NFS command has given up.)
> > > > > >
> > > > > > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834889483 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834889690 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834889898 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834890306 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834891154 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834892818 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834896082 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834902610 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834915922 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834942034 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > [P.], seq 14452:14652, ack 18561, win 501, options [nop,nop,TS
> > > > > > val
> > > > > > 834996306 ecr 1589769203], length 200: NFS request xid
> > > > > > 3614904256
> > > > > > 196 getattr fh
> > > > > > 0,2/53
> > > > > > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS
> > > > > > val
> > > > > > 835010130 ecr 0,nop,wscale 7], length 0
> > > > > > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS
> > > > > > val
> > > > > > 835011155 ecr 0,nop,wscale 7], length 0
> > > > > > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS
> > > > > > val
> > > > > > 835013202 ecr 0,nop,wscale 7], length 0
> > > > > > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS
> > > > > > val
> > > > > > 835017234 ecr 0,nop,wscale 7], length 0
> > > > > > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS
> > > > > > val
> > > > > > 835025490 ecr 0,nop,wscale 7], length 0
> > > > > > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS
> > > > > > val
> > > > > > 835041874 ecr 0,nop,wscale 7], length 0
> > > > > > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > [S], seq 1375256951, win 64240, options [mss 1460,sackOK,TS
> > > > > > val
> > > > > > 835074130 ecr 0,nop,wscale 7], length 0
> > > > > >
> > > > > > I tried changing tcp_retries2 as suggested in another thread
> > > > > > from this list:
> > > > > >
> > > > > > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> > > > > >
> > > > > > ...but it made no difference on either kernel. The 2 minute
> > > > > > timeout also doesn't seem to match with what I'd calculate
> > > > > > from the initial value of tcp_retries2, which should give a
> > > > > > much higher timeout.
> > > > > >
> > > > > > The only clue I've been able to find is in the retry=n entry
> > > > > > in the NFS
> > > > > > manpage:
> > > > > >
> > > > > > " For TCP the default is 3 minutes, but system TCP connection
> > > > > > timeouts will sometimes limit the timeout of each
> > > > > > retransmission to around
> > > > > > 2
> > > > > minutes."
> > > > > >
> > > > > > What I'm not able to make sense of:
> > > > > > - The retry option says that it applies to mount operations,
> > > > > > not read/write operations. However, in this case I'm seeing
> > > > > > the 2 minute delay on read/write operations but *not* mount
> > > > > > operations.
> > > > > > - A couple of hours of searching didn't lead me to any kernel
> > > > > > settings that would result in a 2 minute timeout.
> > > > > >
> > > > > > Does anyone have any clues about a) what's happening and b)
> > > > > > how to get our desired behaviour of being able to control both
> > > > > > mount and read/write timeouts down to a few seconds?
> > > > > >
> > > > > > Thanks.
> > > > >
> > > > > I thought that changing TCP_RTO_MAX in include/net/tcp.h from
> > > > > 120 to
> > > > > something smaller and recompiling the kernel would change the 2
> > > > > minute timeout, but it had no effect. I'm going to keep poking
> > > > > through the kernel code to see if there's a knob I can turn to
> > > > > change the 2 minute timeout, so that I can at least understand
> > > > > where
> > > > > it's coming from.
> > > > >
> > > > > Any hints as to where I should be looking?
> > > >
> > > > I believe I've made some progress with this today:
> > > >
> > > > - Calls to rpc_create() from fs/nfs/client.c are sending an
> > > > rpc_timeout struct with their args.
> > > > - rpc_create() does *not* pass the timeout on to
> > > > xprt_create_transport(), which then can't pass it on to
> > > > xs_setup_tcp().
> > > > - xs_setup_tcp(), having no timeout passed to it, uses
> > > > xs_tcp_default_timeout instead.
> > > > - changing xs_tcp_default_timeout changes the "ls" timeout
> > > > behaviour
> > > > I described above.
> > > >
> > > > In theory all of this means that the timeout simply needs to be
> > > > passed
> > > > through and used instead of xs_tcp_default_timeout. I'm going to
> > > > give
> > > > this a try tomorrow.
> > > >
> > >
> > > That's a great root-cause analysis. The interlocking timeouts
> > > involved with
> > > NFS and its sockets can be really difficult to unwind.
> > >
> > > Is there a way to automate this testcase? That might be nice to have
> > > in
> > > xfstests or the nfstest suite.
> > >
> > > > Here's what I'm going to try first; I'm no C programmer, though,
> > > > so
> > > > any advice or corrections you might have would be appreciated.
> > > >
> > > > Thanks.
> > > >
> > > > Andrew
> > > >
> > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index
> > > > 0b0b9f1eed46..1350c1f489f7 100644
> > > > --- a/net/sunrpc/clnt.c
> > > > +++ b/net/sunrpc/clnt.c
> > > > @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct
> > > > rpc_create_args
> > > > *args)
> > > > .addrlen = args->addrsize,
> > > > .servername = args->servername,
> > > > .bc_xprt = args->bc_xprt,
> > > > + .timeout = args->timeout,
> > > > };
> > > > char servername[48];
> > > > struct rpc_clnt *clnt;
> > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index
> > > > aaa5b2741b79..adc79d94b59e 100644
> > > > --- a/net/sunrpc/xprtsock.c
> > > > +++ b/net/sunrpc/xprtsock.c
> > > > @@ -3003,7 +3003,7 @@ static struct rpc_xprt *xs_setup_tcp(struct
> > > > xprt_create *args)
> > > > xprt->idle_timeout = XS_IDLE_DISC_TO;
> > > >
> > > > xprt->ops = &xs_tcp_ops;
> > > > - xprt->timeout = &xs_tcp_default_timeout;
> > > > + xprt->timeout = args->timeout;
> > > >
> > > > xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
> > > > xprt->connect_timeout = xprt->timeout->to_initval *
> > > >
> > >
> > > Looks like you're probably on the right track. You're missing a few
> > > things:
> > >
> > > You'll need to add a "timeout" field to struct xprt_create in
> > > include/linux/sunrpc/xprt.h, and there may be some other places that
> > > either
> > > need to set the timeout in that structure, or do something with that
> > > field
> > > when it's set.
> > >
> > > Once you have something that fixes your reproducer, go ahead and
> > > post it
> > > and we can help you work through whatever changes need to me made
> to
> > > make it work.
> > >
> > > Nice work!
> >
> > Thanks for the tip, that was helpful.
> >
> > Currently I'm fighting with kernel recompilation. I decided to make
> > it quicker by slimming down the config, but apparently I've achieved
> > something which Google claims no one else has achieved:
> >
> > Errors on kernel make modules_install:
> >
> > DEPMOD /lib/modules/6.2.0-rc5-sunrpctimeo+
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs4_disable_idmapping
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs4_label_alloc
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > send_implementation_id
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs_atomic_open
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs_clear_verifier_delegated
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs4_client_id_uniquifier
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs4_dentry_operations
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs_fscache_open_file
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> nfs4_fs_type
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > recover_lost_locks
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs_callback_nr_threads
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > max_session_cb_slots
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > max_session_slots
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs_idmap_cache_timeout
> > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs_callback_set_tcpport
> >
> > Errors on module load:
> >
> > [ 94.008271] nfsv4: Unknown symbol nfs4_disable_idmapping (err -2)
> > [ 94.008321] nfsv4: Unknown symbol nfs4_label_alloc (err -2)
> > [ 94.008434] nfsv4: Unknown symbol send_implementation_id (err -2)
> > [ 94.008446] nfsv4: Unknown symbol nfs_atomic_open (err -2)
> > [ 94.008468] nfsv4: Unknown symbol nfs_clear_verifier_delegated (err
> > -2)
> > [ 94.008475] nfsv4: Unknown symbol nfs4_client_id_uniquifier (err -
> > 2)
> > [ 94.008501] nfsv4: Unknown symbol nfs4_dentry_operations (err -2)
> > [ 94.008521] nfsv4: Unknown symbol nfs_fscache_open_file (err -2)
> > [ 94.008566] nfsv4: Unknown symbol nfs4_fs_type (err -2)
> > [ 94.008595] nfsv4: Unknown symbol recover_lost_locks (err -2)
> > [ 94.008639] nfsv4: Unknown symbol nfs_callback_nr_threads (err -2)
> > [ 94.008654] nfsv4: Unknown symbol max_session_cb_slots (err -2)
> > [ 94.008678] nfsv4: Unknown symbol max_session_slots (err -2)
> > [ 94.008694] nfsv4: Unknown symbol nfs_idmap_cache_timeout (err -2)
> > [ 94.008709] nfsv4: Unknown symbol nfs_callback_set_tcpport (err -2)
> >
> > I suspect I've turned something off in the config that I shouldn't
> > have, but I'm not sure what. I see that one of the symbols
> > (nfs_clear_verifier_delegated) is in include/linux/nfs_fs.h, and the
> > others are defined in fs/nfs/nfs4_fs.h, fs/nfs/super.c, fs/nfs/dir.c,
> > fs/nfs/inode.c, fs/nfs/fscache.c, and fs/nfs/fs_context.c. I'm
> > changing config options and recompiling to try to figure out what I'm
> > missing, but at a couple of hours per compile and only a couple of
> > days a week to work on this it's slow going. Any hints as to what I
> > might be doing wrong would be appreciated. ????
> >
>
> Looks like the ABI got broken when you turned off some options.
>
> Generally, if you just want to build a single module, then you want the
> .config to be _exactly_ the one that you used to build the kernel you're
> going to plug it into. Then to build the modules under fs/nfs you can
> do:
>
> make modules_prepare
> make M=fs/nfs
>
> ...and then drop the resulting .ko objects into the right place in
> /lib/modules.
>
> That said, it may be simpler to just build and work with a whole kernel
> for testing purposes. Working with an individual kmod can be a bit
> tricky unless you know what you're doing.
>
> Once you do the first, subsequent builds should be reasonably fast.

I'm going to go back to a full kernel build with make oldconfig using the distro's kernel config to try to avoid this latest issue, then try what you've suggested to speed up recompiles.

Since my changes are in net/sunrpc, should I be doing something like this?

make modules_prepare
make M=net/sunrpc
make M=fs/nfs

Or do I not need to recompile nfs if I'm only touching the internals of sunrpc?

Thanks again.

Andrew



2023-01-30 20:31:51

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

On Mon, 2023-01-30 at 20:03 +0000, Andrew Klaassen wrote:
> > From: Jeff Layton <[email protected]>
> > Sent: Monday, January 30, 2023 2:56 PM
> >
> > On Mon, 2023-01-30 at 19:33 +0000, Andrew Klaassen wrote:
> > > > From: Jeff Layton <[email protected]>
> > > > Sent: Friday, January 27, 2023 8:33 AM
> > > >
> > > > On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:
> > > > > > From: Andrew Klaassen <[email protected]>
> > > > > > Sent: Thursday, January 26, 2023 10:32 AM
> > > > > >
> > > > > > > From: Andrew Klaassen <[email protected]>
> > > > > > > Sent: Monday, January 23, 2023 11:31 AM
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > There's a specific NFSv4 mount on a specific machine which
> > > > > > > we'd like to timeout and return an error after a few
> > > > > > > seconds
> > > > > > > if the server goes away.
> > > > > > >
> > > > > > > I've confirmed the following on two different kernels,
> > > > > > > 4.18.0-
> > > > > > > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> > > > > > >
> > > > > > > I've been able to get both autofs and the mount command to
> > > > > > > cooperate, so that the mount attempt fails after an
> > > > > > > arbitrary
> > > > > > > number of seconds.
> > > > > > > This mount command, for example, will fail after 6
> > > > > > > seconds, as
> > > > > > > expected based on the timeo=20,retrans=2,retry=0 options:
> > > > > > >
> > > > > > > $ time sudo mount -t nfs4 -o
> > > > > > > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen
> > > > > > > =255
> > > > > > > ,acr
> > > > > > > egmi
> > > > > > > n
> > > > > > >
> > > > > > =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,time
> > > > > > o=20
> > > > > > ,ret
> > > > > > ra
> > > > > > > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > > > > > > mount.nfs4: Connection timed out
> > > > > > >
> > > > > > > real 0m6.084s
> > > > > > > user 0m0.007s
> > > > > > > sys 0m0.015s
> > > > > > >
> > > > > > > However, if the share is already mounted and the server
> > > > > > > goes
> > > > > > > away, the timeout is always 2 minutes plus the time I
> > > > > > > expect
> > > > > > > based on timeo and retrans. In this case, 2 minutes and 6
> > > > > > > seconds:
> > > > > > >
> > > > > > > $ time ls /mnt/thor04
> > > > > > > ls: cannot access '/mnt/thor04': Connection timed out
> > > > > > >
> > > > > > > real 2m6.025s
> > > > > > > user 0m0.003s
> > > > > > > sys 0m0.000s
> > > > > > >
> > > > > > > Watching the outgoing packets in the second case, the
> > > > > > > pattern
> > > > > > > is always the
> > > > > > > same:
> > > > > > >  - 0.2 seconds between the first two, then doubling each
> > > > > > > time
> > > > > > > until the two minute mark is exceeded (so the last NFS
> > > > > > > packet,
> > > > > > > which is always the 11th packet, is sent around 1:45 after
> > > > > > > the
> > > > > > > first).
> > > > > > >  - Then some generic packets that start exactly-ish on the
> > > > > > > two
> > > > > > > minute mark, 1 second between the first two, then doubling
> > > > > > > each time.
> > > > > > > (By
> > > > > > > this time the NFS command has given up.)
> > > > > > >
> > > > > > > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834889483 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834889690 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834889898 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834890306 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834891154 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834892818 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834896082 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834902610 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834915922 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834942034 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > [nop,nop,TS
> > > > > > > val
> > > > > > > 834996306 ecr 1589769203], length 200: NFS request xid
> > > > > > > 3614904256
> > > > > > > 196 getattr fh
> > > > > > > 0,2/53
> > > > > > > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > 1460,sackOK,TS
> > > > > > > val
> > > > > > > 835010130 ecr 0,nop,wscale 7], length 0
> > > > > > > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > 1460,sackOK,TS
> > > > > > > val
> > > > > > > 835011155 ecr 0,nop,wscale 7], length 0
> > > > > > > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > 1460,sackOK,TS
> > > > > > > val
> > > > > > > 835013202 ecr 0,nop,wscale 7], length 0
> > > > > > > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > 1460,sackOK,TS
> > > > > > > val
> > > > > > > 835017234 ecr 0,nop,wscale 7], length 0
> > > > > > > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > 1460,sackOK,TS
> > > > > > > val
> > > > > > > 835025490 ecr 0,nop,wscale 7], length 0
> > > > > > > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > 1460,sackOK,TS
> > > > > > > val
> > > > > > > 835041874 ecr 0,nop,wscale 7], length 0
> > > > > > > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > 1460,sackOK,TS
> > > > > > > val
> > > > > > > 835074130 ecr 0,nop,wscale 7], length 0
> > > > > > >
> > > > > > > I tried changing tcp_retries2 as suggested in another
> > > > > > > thread
> > > > > > > from this list:
> > > > > > >
> > > > > > > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> > > > > > >
> > > > > > > ...but it made no difference on either kernel. The 2
> > > > > > > minute
> > > > > > > timeout also doesn't seem to match with what I'd calculate
> > > > > > > from the initial value of tcp_retries2, which should give
> > > > > > > a
> > > > > > > much higher timeout.
> > > > > > >
> > > > > > > The only clue I've been able to find is in the retry=n
> > > > > > > entry
> > > > > > > in the NFS
> > > > > > > manpage:
> > > > > > >
> > > > > > > " For TCP the default is 3 minutes, but system TCP
> > > > > > > connection
> > > > > > > timeouts will sometimes limit the timeout of each
> > > > > > > retransmission to around
> > > > > > > 2
> > > > > > minutes."
> > > > > > >
> > > > > > > What I'm not able to make sense of:
> > > > > > >  - The retry option says that it applies to mount
> > > > > > > operations,
> > > > > > > not read/write operations. However, in this case I'm
> > > > > > > seeing
> > > > > > > the 2 minute delay on read/write operations but *not*
> > > > > > > mount
> > > > > > > operations.
> > > > > > >  - A couple of hours of searching didn't lead me to any
> > > > > > > kernel
> > > > > > > settings that would result in a 2 minute timeout.
> > > > > > >
> > > > > > > Does anyone have any clues about a) what's happening and
> > > > > > > b)
> > > > > > > how to get our desired behaviour of being able to control
> > > > > > > both
> > > > > > > mount and read/write timeouts down to a few seconds?
> > > > > > >
> > > > > > > Thanks.
> > > > > >
> > > > > > I thought that changing TCP_RTO_MAX in include/net/tcp.h
> > > > > > from
> > > > > > 120 to
> > > > > > something smaller and recompiling the kernel would change
> > > > > > the 2
> > > > > > minute timeout, but it had no effect. I'm going to keep
> > > > > > poking
> > > > > > through the kernel code to see if there's a knob I can turn
> > > > > > to
> > > > > > change the 2 minute timeout, so that I can at least
> > > > > > understand
> > > > > > where
> > > > > > it's coming from.
> > > > > >
> > > > > > Any hints as to where I should be looking?
> > > > >
> > > > > I believe I've made some progress with this today:
> > > > >
> > > > >  - Calls to rpc_create() from fs/nfs/client.c are sending an
> > > > > rpc_timeout struct with their args.
> > > > >  - rpc_create() does *not* pass the timeout on to
> > > > > xprt_create_transport(), which then can't pass it on to
> > > > > xs_setup_tcp().
> > > > >  - xs_setup_tcp(), having no timeout passed to it, uses
> > > > > xs_tcp_default_timeout instead.
> > > > >  - changing xs_tcp_default_timeout changes the "ls" timeout
> > > > > behaviour
> > > > > I described above.
> > > > >
> > > > > In theory all of this means that the timeout simply needs to
> > > > > be
> > > > > passed
> > > > > through and used instead of xs_tcp_default_timeout. I'm going
> > > > > to
> > > > > give
> > > > > this a try tomorrow.
> > > > >
> > > >
> > > > That's a great root-cause analysis. The interlocking timeouts
> > > > involved with
> > > > NFS and its sockets can be really difficult to unwind.
> > > >
> > > > Is there a way to automate this testcase? That might be nice to
> > > > have
> > > > in
> > > > xfstests or the nfstest suite.
> > > >
> > > > > Here's what I'm going to try first; I'm no C programmer,
> > > > > though,
> > > > > so
> > > > > any advice or corrections you might have would be appreciated.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Andrew
> > > > >
> > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index
> > > > > 0b0b9f1eed46..1350c1f489f7 100644
> > > > > --- a/net/sunrpc/clnt.c
> > > > > +++ b/net/sunrpc/clnt.c
> > > > > @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct
> > > > > rpc_create_args
> > > > > *args)
> > > > >                 .addrlen = args->addrsize,
> > > > >                 .servername = args->servername,
> > > > >                 .bc_xprt = args->bc_xprt,
> > > > > + .timeout = args->timeout,
> > > > >         };
> > > > >         char servername[48];
> > > > >         struct rpc_clnt *clnt;
> > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > > > index
> > > > > aaa5b2741b79..adc79d94b59e 100644
> > > > > --- a/net/sunrpc/xprtsock.c
> > > > > +++ b/net/sunrpc/xprtsock.c
> > > > > @@ -3003,7 +3003,7 @@ static struct rpc_xprt
> > > > > *xs_setup_tcp(struct
> > > > > xprt_create *args)
> > > > >         xprt->idle_timeout = XS_IDLE_DISC_TO;
> > > > >
> > > > >         xprt->ops = &xs_tcp_ops;
> > > > > - xprt->timeout = &xs_tcp_default_timeout;
> > > > > + xprt->timeout = args->timeout;
> > > > >
> > > > >         xprt->max_reconnect_timeout = xprt->timeout-
> > > > > >to_maxval;
> > > > >         xprt->connect_timeout = xprt->timeout->to_initval *
> > > > >
> > > >
> > > > Looks like you're probably on the right track. You're missing a
> > > > few
> > > > things:
> > > >
> > > > You'll need to add a "timeout" field to struct xprt_create in
> > > > include/linux/sunrpc/xprt.h, and there may be some other places
> > > > that
> > > > either
> > > > need to set the timeout in that structure, or do something with
> > > > that
> > > > field
> > > > when it's set.
> > > >
> > > > Once you have something that fixes your reproducer, go ahead and
> > > > post it
> > > > and we can help you work through whatever changes need to me
> > > > made
> > to
> > > > make it work.
> > > >
> > > > Nice work!
> > >
> > > Thanks for the tip, that was helpful.
> > >
> > > Currently I'm fighting with kernel recompilation. I decided to
> > > make
> > > it quicker by slimming down the config, but apparently I've
> > > achieved
> > > something which Google claims no one else has achieved:
> > >
> > > Errors on kernel make modules_install:
> > >
> > >   DEPMOD /lib/modules/6.2.0-rc5-sunrpctimeo+
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs4_disable_idmapping
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs4_label_alloc
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > send_implementation_id
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs_atomic_open
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs_clear_verifier_delegated
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs4_client_id_uniquifier
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs4_dentry_operations
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs_fscache_open_file
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > nfs4_fs_type
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > recover_lost_locks
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs_callback_nr_threads
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > max_session_cb_slots
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > max_session_slots
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs_idmap_cache_timeout
> > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs_callback_set_tcpport
> > >
> > > Errors on module load:
> > >
> > > [ 94.008271] nfsv4: Unknown symbol nfs4_disable_idmapping (err -
> > > 2)
> > > [ 94.008321] nfsv4: Unknown symbol nfs4_label_alloc (err -2)
> > > [ 94.008434] nfsv4: Unknown symbol send_implementation_id (err -
> > > 2)
> > > [ 94.008446] nfsv4: Unknown symbol nfs_atomic_open (err -2)
> > > [ 94.008468] nfsv4: Unknown symbol nfs_clear_verifier_delegated
> > > (err
> > > -2)
> > > [ 94.008475] nfsv4: Unknown symbol nfs4_client_id_uniquifier
> > > (err -
> > > 2)
> > > [ 94.008501] nfsv4: Unknown symbol nfs4_dentry_operations (err -
> > > 2)
> > > [ 94.008521] nfsv4: Unknown symbol nfs_fscache_open_file (err -
> > > 2)
> > > [ 94.008566] nfsv4: Unknown symbol nfs4_fs_type (err -2)
> > > [ 94.008595] nfsv4: Unknown symbol recover_lost_locks (err -2)
> > > [ 94.008639] nfsv4: Unknown symbol nfs_callback_nr_threads (err
> > > -2)
> > > [ 94.008654] nfsv4: Unknown symbol max_session_cb_slots (err -2)
> > > [ 94.008678] nfsv4: Unknown symbol max_session_slots (err -2)
> > > [ 94.008694] nfsv4: Unknown symbol nfs_idmap_cache_timeout (err
> > > -2)
> > > [ 94.008709] nfsv4: Unknown symbol nfs_callback_set_tcpport (err
> > > -2)
> > >
> > > I suspect I've turned something off in the config that I shouldn't
> > > have, but I'm not sure what. I see that one of the symbols
> > > (nfs_clear_verifier_delegated) is in include/linux/nfs_fs.h, and
> > > the
> > > others are defined in fs/nfs/nfs4_fs.h, fs/nfs/super.c,
> > > fs/nfs/dir.c,
> > > fs/nfs/inode.c, fs/nfs/fscache.c, and fs/nfs/fs_context.c. I'm
> > > changing config options and recompiling to try to figure out what
> > > I'm
> > > missing, but at a couple of hours per compile and only a couple of
> > > days a week to work on this it's slow going. Any hints as to what
> > > I
> > > might be doing wrong would be appreciated. ????
> > >
> >
> > Looks like the ABI got broken when you turned off some options.
> >
> > Generally, if you just want to build a single module, then you want
> > the
> > .config to be _exactly_ the one that you used to build the kernel
> > you're
> > going to plug it into. Then to build the modules under fs/nfs you
> > can
> > do:
> >
> >     make modules_prepare
> >     make M=fs/nfs
> >
> > ...and then drop the resulting .ko objects into the right place in
> > /lib/modules.
> >
> > That said, it may be simpler to just build and work with a whole
> > kernel
> > for testing purposes. Working with an individual kmod can be a bit
> > tricky unless you know what you're doing.
> >
> > Once you do the first, subsequent builds should be reasonably fast.
>
> I'm going to go back to a full kernel build with make oldconfig using
> the distro's kernel config to try to avoid this latest issue, then try
> what you've suggested to speed up recompiles.
>
> Since my changes are in net/sunrpc, should I be doing something like
> this?
>
> make modules_prepare
> make M=net/sunrpc
> make M=fs/nfs
>
> Or do I not need to recompile nfs if I'm only touching the internals
> of sunrpc?
>
> Thanks again.
>
> Andrew
>
>
>

You shouldn't need to build both *UNLESS* you change the ABI. That
includes stuff like number of arguments to an exported function, or the
size or layout of particular structures or arrays that both modules
might be working with, etc...

If you do that, then things can break in all sorts of "interesting" ways
that can be very hard to track down. Without seeing your patch, it's
hard to know whether you're breaking those rules here. YMMV, of course.

Again, doing a full kernel build is the safest way to avoid that sort of
thing. I'd council against shortcuts here unless you know what you're
doing. Let the machine do the work. ;)
--
Jeff Layton <[email protected]>

2023-01-30 22:12:01

by Andrew J. Romero

[permalink] [raw]
Subject: Zombie / Orphan open files

Hi

This is a quick general NFS server question.

Does the NFSv4x specification require or recommend that: the NFS server, after some reasonable time,
should / must close orphan / zombie open files ?

On several NAS platforms I have seen large numbers of orphan / zombie open files "pile up"
as a result of Kerberos credential expiration.

Does the Red Hat NFS server "deal with" orphan / zombie open files ?

Thanks

Andy Romero
Fermilab


2023-01-31 00:10:53

by Chuck Lever

[permalink] [raw]
Subject: Re: Zombie / Orphan open files



> On Jan 30, 2023, at 5:11 PM, Andrew J. Romero <[email protected]> wrote:
>
> Hi
>
> This is a quick general NFS server question.
>
> Does the NFSv4x specification require or recommend that: the NFS server, after some reasonable time,
> should / must close orphan / zombie open files ?

No it does not. A server is supposed to leave open state alone
if the client continues to renew its lease.

A server has some recourse, though. It can recall delegations
to free up resources. We have some patches for v6.2 that do
that.

Servers can also free state where subsequent accesses by a
client indicate that the server administrator has revoked
that state. I don't believe the spec makes any statement
about when to use this facility or how to choose state to
purge, and I'm pretty sure Linux NFSD does not implement it.

A heavyweight tool would be to simulate a server reboot to
force clients to acknowledge which state they are still
using, via state recovery.


> On several NAS platforms I have seen large numbers of orphan / zombie open files "pile up"
> as a result of Kerberos credential expiration.
>
> Does the Red Hat NFS server "deal with" orphan / zombie open files ?

Not currently, nor does the upstream server.

Purging state is not terribly good for data integrity guarantees,
and I'm not sure how the server would make fair choices about
what OPEN stateids to purge.

So before going down that path I would like to see if the file
leakage might be the result of aberrant client behavior, and
try to address the issue from that side first. Do you have a
simple reproducer for this issue? How do you observe the
orphaned files?

--
Chuck Lever




2023-01-31 13:27:49

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Mon, 2023-01-30 at 22:11 +0000, Andrew J. Romero wrote:
> Hi
>
> This is a quick general NFS server question.
>
> Does the NFSv4x specification require or recommend that: the NFS server, after some reasonable time,
> should / must close orphan / zombie open files ?
>
> On several NAS platforms I have seen large numbers of orphan / zombie open files "pile up"
> as a result of Kerberos credential expiration.
>
> Does the Red Hat NFS server "deal with" orphan / zombie open files ?
>
> Thanks
>
> Andy Romero
> Fermilab
>

What do you mean by "zombie / orphan" here? Do you mean files that have
been sillyrenamed [1] to ".nfsXXXXXXX" ? Or are you simply talking about
clients that are holding files open for a long time?

--
Jeff Layton <[email protected]>

[1]: https://linux-nfs.org/wiki/index.php/Server-side_silly_rename

2023-01-31 14:42:24

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files



> From: Jeff Layton <[email protected]>

> What do you mean by "zombie / orphan" here? Do you mean files that have
> been sillyrenamed [1] to ".nfsXXXXXXX" ? Or are you simply talking about
> clients that are holding files open for a long time?

Hi Jeff

.... clients that are holding files open for a long time

Here's a complete summary:

On my NAS appliances , I noticed that average usage of the relevant memory pool
never went down. I suspected some sort of "leak" or "file-stuck-open" scenario.

I hypothesized that if NFS-client to NFS-server communications were frequently disrupted,
this would explain the memory-pool behavior I was seeing.
I felt that Kerberos credential expiration was the most likely frequent disruptor.

I ran a simple python test script that (1) opened enough files that I could see an obvious jump
in the relevant NAS memory pool metric, then (2) went to sleep for shorter than the
Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
The result: After the script exited, usage of the relevant server-side memory pool decreased by
the expected amount.

Then I ran a simple python test script that (1) opened enough files that I could see an obvious jump
in the relevant NAS memory pool metric, then (2) went to sleep for longer than the
Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
The result: After the script exited, usage of the relevant server-side memory pool did not decrease.
( the files opened by the script were permanently "stuck open" ... depleting the server-side pool resource)

In a large campus environment, usage of the relevant memory pool will eventually get so
high that a server-side reboot will be needed.

I'm working with my NAS vendor ( who is very helpful ); however, if the NFS server and client specifications
don't specify an official way to handle this very real problem, there is not much a NAS server vendor can safely / responsibly do.

If there currently is no formal/official way of handling this issue ( server-side pool exhaustion due to "disappearing" client )
is this a problem worth solving ( at a level lower than the application level )?

If client applications were all well behaved ( didn't leave files open for long periods of time ) we wouldn't have a significant issue.
Assuming applications aren't going to be well behaved, are there good general ways of solving this on either the client or server side ?

Thanks

Andy



2023-01-31 15:26:28

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, 2023-01-31 at 14:42 +0000, Andrew J. Romero wrote:
>
> > From: Jeff Layton <[email protected]>
>
> > What do you mean by "zombie / orphan" here? Do you mean files that have
> > been sillyrenamed [1] to ".nfsXXXXXXX" ? Or are you simply talking about
> > clients that are holding files open for a long time?
>
> Hi Jeff
>
> .... clients that are holding files open for a long time
>
> Here's a complete summary:
>
> On my NAS appliances , I noticed that average usage of the relevant memory pool
> never went down. I suspected some sort of "leak" or "file-stuck-open" scenario.
>
> I hypothesized that if NFS-client to NFS-server communications were frequently disrupted,
> this would explain the memory-pool behavior I was seeing.
> I felt that Kerberos credential expiration was the most likely frequent disruptor.
>
> I ran a simple python test script that (1) opened enough files that I could see an obvious jump
> in the relevant NAS memory pool metric, then (2) went to sleep for shorter than the
> Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
> The result: After the script exited, usage of the relevant server-side memory pool decreased by
> the expected amount.
>
> Then I ran a simple python test script that (1) opened enough files that I could see an obvious jump
> in the relevant NAS memory pool metric, then (2) went to sleep for longer than the
> Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
> The result: After the script exited, usage of the relevant server-side memory pool did not decrease.
> ( the files opened by the script were permanently "stuck open" ... depleting the server-side pool resource)
>
> In a large campus environment, usage of the relevant memory pool will eventually get so
> high that a server-side reboot will be needed.
>
> I'm working with my NAS vendor ( who is very helpful ); however, if the NFS server and client specifications
> don't specify an official way to handle this very real problem, there is not much a NAS server vendor can safely / responsibly do.
>
> If there currently is no formal/official way of handling this issue ( server-side pool exhaustion due to "disappearing" client )
> is this a problem worth solving ( at a level lower than the application level )?
>
> If client applications were all well behaved ( didn't leave files open for long periods of time ) we wouldn't have a significant issue.
> Assuming applications aren't going to be well behaved, are there good general ways of solving this on either the client or server side ?
>

Yeah, that's an interesting problem. From the server's standpoint, we're
just doing what the client asked. It asked for an open stateid and we
gave it one. It's up to the client to release it, as long as it keeps
renewing its lease.

Would it be wrong for to use client creds for CLOSE or OPEN_DOWNGRADE
RPCs when the appropriate user creds are not available? The client is
just releasing resources, after all, which seems like something that
ought not require a specific set of creds on the server.

--
Jeff Layton <[email protected]>

2023-01-31 15:32:03

by Chuck Lever

[permalink] [raw]
Subject: Re: Zombie / Orphan open files



> On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <[email protected]> wrote:
>
>
>
>> From: Jeff Layton <[email protected]>
>
>> What do you mean by "zombie / orphan" here? Do you mean files that have
>> been sillyrenamed [1] to ".nfsXXXXXXX" ? Or are you simply talking about
>> clients that are holding files open for a long time?
>
> Hi Jeff
>
> .... clients that are holding files open for a long time
>
> Here's a complete summary:
>
> On my NAS appliances , I noticed that average usage of the relevant memory pool
> never went down. I suspected some sort of "leak" or "file-stuck-open" scenario.
>
> I hypothesized that if NFS-client to NFS-server communications were frequently disrupted,
> this would explain the memory-pool behavior I was seeing.
> I felt that Kerberos credential expiration was the most likely frequent disruptor.
>
> I ran a simple python test script that (1) opened enough files that I could see an obvious jump
> in the relevant NAS memory pool metric, then (2) went to sleep for shorter than the
> Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
> The result: After the script exited, usage of the relevant server-side memory pool decreased by
> the expected amount.
>
> Then I ran a simple python test script that (1) opened enough files that I could see an obvious jump
> in the relevant NAS memory pool metric, then (2) went to sleep for longer than the
> Kerberos ticket lifetime, then (3) exited without explicitly closing the files.
> The result: After the script exited, usage of the relevant server-side memory pool did not decrease.
> ( the files opened by the script were permanently "stuck open" ... depleting the server-side pool resource)
>
> In a large campus environment, usage of the relevant memory pool will eventually get so
> high that a server-side reboot will be needed.
>
> I'm working with my NAS vendor ( who is very helpful ); however, if the NFS server and client specifications
> don't specify an official way to handle this very real problem, there is not much a NAS server vendor can safely / responsibly do.

Yes, there is: the NAS vendor can report the problem to the people
they get their server code from :-)


> If there currently is no formal/official way of handling this issue ( server-side pool exhaustion due to "disappearing" client )
> is this a problem worth solving ( at a level lower than the application level )?

Yes, this is IMO unwelcome behavior, and a real problem for large
scale deployment, as you describe above.

But let's be careful: a "disappearing client" should be handled
properly: its lease will expire and the server will eventually
close out any OPEN state that client was responsible for.

If the client continues to renew its state, and the appplication
doesn't quit or close its files, neither the client or server
can tell easily that there is a problem.

Moreover, ticket expiry is not necessarily an indication that the
application is done with a file.


> If client applications were all well behaved ( didn't leave files open for long periods of time ) we wouldn't have a significant issue.
> Assuming applications aren't going to be well behaved, are there good general ways of solving this on either the client or server side ?

The server needs to manage its resource pools appropriately,
otherwise it is exposed to DoS or DDoS attacks. That will
improve over time, but I'm not seeing an immediate way to
fairly address this on the server side. As Jeff said, the
server is just doing what clients are asking of it.

The client-side needs to clean up when it can, so we need to
explore that. Actually that might be where you have a little
more immediate control of this situation. The applications
need to either re-authenticate or close files they no longer
need. I think you'd have this problem with long-lived
applications running on one big system as well.


--
Chuck Lever




2023-01-31 16:27:11

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Mon, Jan 30, 2023 at 5:44 PM Andrew J. Romero <[email protected]> wrote:
>
> Hi
>
> This is a quick general NFS server question.
>
> Does the NFSv4x specification require or recommend that: the NFS server, after some reasonable time,
> should / must close orphan / zombie open files ?

Why should the server be responsible for a badly behaving client? It
seems like you are advocating for the world where a problem is hidden
rather than solved. But because bugs do occur and some customers want
a quick solution, some storage providers do have ways of dealing with
releasing resources (like open state) that the client will never ask
for again.

Why should we excuse bad user behaviour? For things like long running
jobs users have to be educated that their credentials must stay valid
for the duration of their usage.

Why should we excuse poor application behaviour that doesn't close
files? But in a way we do, the OS will make sure that the file is
closed when the application exists without explicitly closing the
file. So I'm curious how do you get in a state with zombie?

> On several NAS platforms I have seen large numbers of orphan / zombie open files "pile up"
> as a result of Kerberos credential expiration.
>
> Does the Red Hat NFS server "deal with" orphan / zombie open files ?
>
> Thanks
>
> Andy Romero
> Fermilab
>
>

2023-01-31 16:35:25

by Chuck Lever

[permalink] [raw]
Subject: Re: Zombie / Orphan open files



> On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <[email protected]> wrote:
>
> In a large campus environment, usage of the relevant memory pool will eventually get so
> high that a server-side reboot will be needed.

The above is sticking with me a bit.

Rebooting the server should force clients to re-establish state.

Are they not re-establishing open file state for users whose
ticket has expired? I would think each client would re-establish
state for those open files anyway, and the server would be in the
same overcommitted state it was in before it rebooted.

We might not have an accurate root cause analysis yet, or I could
be missing something.

--
Chuck Lever




2023-01-31 16:59:31

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files



> -----Original Message-----
> From: Chuck Lever III <[email protected]>
>
> > On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <[email protected]> wrote:
> >
> > In a large campus environment, usage of the relevant memory pool will eventually get so
> > high that a server-side reboot will be needed.
>
> The above is sticking with me a bit.
>
> Rebooting the server should force clients to re-establish state.
>
> Are they not re-establishing open file state for users whose
> ticket has expired?


> I would think each client would re-establish
> state for those open files anyway, and the server would be in the
> same overcommitted state it was in before it rebooted.


When the number of opens gets close to the limit which would result in
a disruptive NFSv4 service interruption ( currently 128K open files is the limit),
I do the reboot ( actually I transfer the affected NFS serving resource
from one NAS cluster-node to the other NAS cluster node ... this based on experience
is like a 99.9% "non-disruptive reboot" of the affected NFS serving resource )

Before the resource transfer there will be ~126K open files
( from the NAS perspective )
0.1 seconds after the resource transfer there will be
close to zero files open. Within a few seconds there will
be ~2000 and within a few minutes there will be ~2100.
During the rest of the day I only see a slow rise in the average number
of opens to maybe 2200. ( my take is ~2100 files were "active opens" before and after
the resource transfer , the rest of the 126K opens were zombies
that the clients were no longer using ). In 4-6 months
the number of opens from the NAS perspective will slowly
creep back up to the limit.



>
> We might not have an accurate root cause analysis yet, or I could
> be missing something.
>
> --
> Chuck Lever
>
>


2023-01-31 17:44:40

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files



> From: Olga Kornievskaia <[email protected]>
> > Hi
> >
> > This is a quick general NFS server question.
> >
> > Does the NFSv4x specification require or recommend that: the NFS server, after some reasonable time,
> > should / must close orphan / zombie open files ?
>
> Why should the server be responsible for a badly behaving client? It
> seems like you are advocating for the world where a problem is hidden
> rather than solved. But because bugs do occur and some customers want
> a quick solution, some storage providers do have ways of dealing with
> releasing resources (like open state) that the client will never ask
> for again.
>
> Why should we excuse bad user behaviour? For things like long running
> jobs users have to be educated that their credentials must stay valid
> for the duration of their usage.
>
> Why should we excuse poor application behaviour that doesn't close
> files? But in a way we do, the OS will make sure that the file is
> closed when the application exists without explicitly closing the


From the perspective of the system-admin of a large NFS server
that provides services to a large multi-mission user base,
Making the file-service ( client and server file-service components )
tolerant, of foolishness and malice ( DOS / DDOS ) at the application
layer, is highly advantageous.

This is not a problem of determining where to justly place blame.
We are all fairly certain that the "culprits" are users creating / using bad applications; however,
We are 100% certain that the strategy of eliminating all bad applications ( accidental and
intentional) is a bit like global-peace ... highly desirable; but, not so easy to implement.


> file. So I'm curious how do you get in a state with zombie?

I think in most cases its:
file is open for a period of time long enough to be affected by a "disruption".

The most common "disruption" for me appears to be Kerberos ticket expiration
for interactive user sessions. ( for known long running background /robotic tasks
I have people manage (keytab based ) credentials with gssproxy .. works nicely; but
not for interactive use )









2023-01-31 18:06:13

by Chuck Lever

[permalink] [raw]
Subject: Re: Zombie / Orphan open files



> On Jan 31, 2023, at 11:59 AM, Andrew J. Romero <[email protected]> wrote:
>
>
>
>> -----Original Message-----
>> From: Chuck Lever III <[email protected]>
>>
>>> On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <[email protected]> wrote:
>>>
>>> In a large campus environment, usage of the relevant memory pool will eventually get so
>>> high that a server-side reboot will be needed.
>>
>> The above is sticking with me a bit.
>>
>> Rebooting the server should force clients to re-establish state.
>>
>> Are they not re-establishing open file state for users whose
>> ticket has expired?
>
>
>> I would think each client would re-establish
>> state for those open files anyway, and the server would be in the
>> same overcommitted state it was in before it rebooted.
>
>
> When the number of opens gets close to the limit which would result in
> a disruptive NFSv4 service interruption ( currently 128K open files is the limit),
> I do the reboot ( actually I transfer the affected NFS serving resource
> from one NAS cluster-node to the other NAS cluster node ... this based on experience
> is like a 99.9% "non-disruptive reboot" of the affected NFS serving resource )
>
> Before the resource transfer there will be ~126K open files
> ( from the NAS perspective )
> 0.1 seconds after the resource transfer there will be
> close to zero files open. Within a few seconds there will
> be ~2000 and within a few minutes there will be ~2100.
> During the rest of the day I only see a slow rise in the average number
> of opens to maybe 2200. ( my take is ~2100 files were "active opens" before and after
> the resource transfer , the rest of the 126K opens were zombies
> that the clients were no longer using ).

That's not the way state recovery works. Clients will reopen only
the files that are still in use. If the clients don't open the
"zombie" files again, then I'm fairly certain the applications
have already closed those files.

In other words, the server might have an internal resource leak
instead.


> In 4-6 months
> the number of opens from the NAS perspective will slowly
> creep back up to the limit.

We will need to have a better understanding of where the leaks
actually come from. You have provided one way that an open leak
can happen, but that way doesn't line up with the evidence you
have here. So I agree that something is amiss, but more analysis
is necessary.

What release of the Linux kernel is your NAS device running?


--
Chuck Lever




2023-01-31 18:13:39

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, 2023-01-31 at 16:34 +0000, Chuck Lever III wrote:
>
> > On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <[email protected]> wrote:
> >
> > In a large campus environment, usage of the relevant memory pool will eventually get so
> > high that a server-side reboot will be needed.
>
> The above is sticking with me a bit.
>
> Rebooting the server should force clients to re-establish state.
>
> Are they not re-establishing open file state for users whose
> ticket has expired? I would think each client would re-establish
> state for those open files anyway, and the server would be in the
> same overcommitted state it was in before it rebooted.
>
> We might not have an accurate root cause analysis yet, or I could
> be missing something.
>

My assumption was that the client wasn't able to get credentials to run
the CLOSE RPC in this case, so it can't properly send the call. That's a
big assumption though. It'd be good to confirm this.

It looks like the CLOSE codepath on the client calls nfs4_state_protect
with NFS_SP4_MACH_CRED_CLEANUP, and that should make it use the machine
cred? I'm not 100% clear here though...it looks like that may be
conditional on what was sent by the server in EXCHANGE_ID.

FWIW, I don't see any reason we shouldn't use the machine cred for the
close compound. Nothing we do in there should require permission
checking.

BTW: is this NFSv4.0 or v4.1+ (or a mix)?
--
Jeff Layton <[email protected]>

2023-01-31 18:24:06

by Frank Filz

[permalink] [raw]
Subject: RE: Zombie / Orphan open files

> On Mon, Jan 30, 2023 at 5:44 PM Andrew J. Romero <[email protected]> wrote:
> >
> > Hi
> >
> > This is a quick general NFS server question.
> >
> > Does the NFSv4x specification require or recommend that: the NFS server,
> after some reasonable time,
> > should / must close orphan / zombie open files ?
>
> Why should the server be responsible for a badly behaving client? It seems like
> you are advocating for the world where a problem is hidden rather than solved.
> But because bugs do occur and some customers want a quick solution, some
> storage providers do have ways of dealing with releasing resources (like open
> state) that the client will never ask for again.
>
> Why should we excuse bad user behaviour? For things like long running jobs
> users have to be educated that their credentials must stay valid for the duration
> of their usage.
>
> Why should we excuse poor application behaviour that doesn't close files? But in
> a way we do, the OS will make sure that the file is closed when the application
> exists without explicitly closing the file. So I'm curious how do you get in a state
> with zombie?

Don't automatically assume this is bad application behavior, though it may be behavior we don't all like, sometimes it may be for a reason. Applications may be keeping a file open to protect the file (works best when share deny modes are available, i.e. most likely a Windows client). Also, won't an executable be kept open for the lifetime of the process, especially if the executable is large enough that it will be paged in/out from the file? This assures the same executable is available for the lifetime of the process even if deleted and replaced with a new version.

Now whether this kind of activity is desirable via NFS may be another question...

Frank


2023-01-31 18:35:04

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files



>
> That's not the way state recovery works. Clients will reopen only
> the files that are still in use. If the clients don't open the
> "zombie" files again, then I'm fairly certain the applications
> have already closed those files.

Hi

In the case of my test script , I know that the files were not
closed explicitly or on script termination. ( script terminated
without credentials ) . By the time my session re-acquired credentials
( intentionally after process termination) , the process was already terminated
and nothing, on the client, would ever attempt to clean-up the
server-side "zombie open files"

The server-side pool usage caused by my intentionally
bad test script was not freed up until I did the cluster resource migration.

Question:
When a simple app (for example a python script ) on the NFS client
simply opens a text file, is a lease automatically, behind the scenes,
created on the server. If so, is the server responsible for doing this:
If the lease isn't renewed every N minutes, close the file.

By "simply opens" a text file, I mean that: the script contains no
code to request or in any way explicitly use locks



Thanks



2023-01-31 18:52:12

by Chuck Lever

[permalink] [raw]
Subject: Re: Zombie / Orphan open files


> On Jan 31, 2023, at 1:33 PM, Andrew J. Romero <[email protected]> wrote:
>
>> That's not the way state recovery works. Clients will reopen only
>> the files that are still in use. If the clients don't open the
>> "zombie" files again, then I'm fairly certain the applications
>> have already closed those files.
>
> Hi
>
> In the case of my test script , I know that the files were not
> closed explicitly or on script termination. ( script terminated
> without credentials ) . By the time my session re-acquired credentials
> ( intentionally after process termination) , the process was already terminated
> and nothing, on the client, would ever attempt to clean-up the
> server-side "zombie open files"
>
> The server-side pool usage caused by my intentionally
> bad test script was not freed up until I did the cluster resource migration.
>
> Question:
> When a simple app (for example a python script ) on the NFS client
> simply opens a text file, is a lease automatically, behind the scenes,
> created on the server. If so, is the server responsible for doing this:
> If the lease isn't renewed every N minutes, close the file.

Almost. The protocol requires:

After the client reboots, when it opens its first file, the client
does a SETCLIENTID or EXCHANGE_ID to establish its lease on the
server. All OPEN and LOCK state is managed under the umbrella of
that lease (and that includes all files that client is managing).
The client keeps the lease alive by renewing the lease every minute.

If the client reboots (ie, does a subsequent SETCLIENTID or
EXCHANGE_ID with a new boot verifier), the server has to purge all
open file state for that client.

If the client fails to renew its lease, the server is free to do
what it wants -- it can purge the client's lease immediately, or
it can wait until conflicting opens or locks come from other
clients and then purge some or all of that client's lease.

If the client can't or doesn't CLOSE that file, it will remain
on the server until the client tells it (implicitly by not
renewing or explicitly with a fresh ID) that the state is no
longer needed; or until the server reboots and the client does
not re-establish the OPEN state.

Therefore, rebooting individual clients that have accrued these
zombie files should also clear them out without interrupting the
file service for everyone else.

But again, we need some way to confirm exactly how this is
happening. Can you post your script, or capture client-server
network traffic while the script does its thing?


> By "simply opens" a text file, I mean that: the script contains no
> code to request or in any way explicitly use locks


--
Chuck Lever




2023-01-31 19:08:48

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, Jan 31, 2023 at 1:35 PM Andrew J. Romero <[email protected]> wrote:
>
>
>
> >
> > That's not the way state recovery works. Clients will reopen only
> > the files that are still in use. If the clients don't open the
> > "zombie" files again, then I'm fairly certain the applications
> > have already closed those files.
>
> Hi
>
> In the case of my test script , I know that the files were not
> closed explicitly or on script termination.

How do you know that the files were not closed on the script
termination? One way to see what the OS might be doing for you is to
grab either a set of tracepoints or a network trace. A client would
have sent the close but it was for some reason rejected by the server?

> ( script terminated
> without credentials ) . By the time my session re-acquired credentials
> ( intentionally after process termination) , the process was already terminated
> and nothing, on the client, would ever attempt to clean-up the
> server-side "zombie open files"
>
> The server-side pool usage caused by my intentionally
> bad test script was not freed up until I did the cluster resource migration.

Once you did a migration event (which is how storage can recover from
having unrecoverable state btw), if the client (ie., the kernel, not
the script) "truly" didn't close files, then the kernel would have
recovered the open state again. However, I suspect that a resource
migration event helps to get out of a bad state. Which means, the
client (ie, kernel) did try to close the file but failed to do so
(lack of creds as you say) and since the kernel won't try to recover
from the lack of creds forever, it might give up on doing the close.
Yet, on the server side that state would remain. And something like a
migration event (which is non-disruptive to the client) is a way to
get out of such situations.

> Question:
> When a simple app (for example a python script ) on the NFS client
> simply opens a text file, is a lease automatically, behind the scenes,
> created on the server. If so, is the server responsible for doing this:
> If the lease isn't renewed every N minutes, close the file.
>
> By "simply opens" a text file, I mean that: the script contains no
> code to request or in any way explicitly use locks
>
>
>
> Thanks
>
>

2023-01-31 19:20:11

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, Jan 31, 2023 at 1:19 PM Frank Filz <[email protected]> wrote:
>
> > On Mon, Jan 30, 2023 at 5:44 PM Andrew J. Romero <[email protected]> wrote:
> > >
> > > Hi
> > >
> > > This is a quick general NFS server question.
> > >
> > > Does the NFSv4x specification require or recommend that: the NFS server,
> > after some reasonable time,
> > > should / must close orphan / zombie open files ?
> >
> > Why should the server be responsible for a badly behaving client? It seems like
> > you are advocating for the world where a problem is hidden rather than solved.
> > But because bugs do occur and some customers want a quick solution, some
> > storage providers do have ways of dealing with releasing resources (like open
> > state) that the client will never ask for again.
> >
> > Why should we excuse bad user behaviour? For things like long running jobs
> > users have to be educated that their credentials must stay valid for the duration
> > of their usage.
> >
> > Why should we excuse poor application behaviour that doesn't close files? But in
> > a way we do, the OS will make sure that the file is closed when the application
> > exists without explicitly closing the file. So I'm curious how do you get in a state
> > with zombie?
>
> Don't automatically assume this is bad application behavior, though it may be behavior we don't all like, sometimes it may be for a reason. Applications may be keeping a file open to protect the file (works best when share deny modes are available, i.e. most likely a Windows client). Also, won't an executable be kept open for the lifetime of the process, especially if the executable is large enough that it will be paged in/out from the file? This assures the same executable is available for the lifetime of the process even if deleted and replaced with a new version.

Aren't you describing is a long running job (a file that needs to be
kept opened -- and not closed -- for a long period of time)? And it's
a user's responsibility to have creds that are long enough (or a
system of renewal) that covers the duration of the job. To be clear
you are talking about a long running process that keeps a file opened.
You are not talking about a process that starts, opens a file and the
process exits without closing a file. That's poor application
behaviour I was referring too. Regardless in that situation OS cleans
up. So I'm very curious how these zombie/orphan files are being
created, how does it happens that the OS doesn't clean up.

> Now whether this kind of activity is desirable via NFS may be another question...
>
> Frank
>

2023-01-31 19:31:42

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, Jan 31, 2023 at 12:12 PM Andrew J. Romero <[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Chuck Lever III <[email protected]>
> >
> > > On Jan 31, 2023, at 9:42 AM, Andrew J. Romero <[email protected]> wrote:
> > >
> > > In a large campus environment, usage of the relevant memory pool will eventually get so
> > > high that a server-side reboot will be needed.
> >
> > The above is sticking with me a bit.
> >
> > Rebooting the server should force clients to re-establish state.
> >
> > Are they not re-establishing open file state for users whose
> > ticket has expired?
>
>
> > I would think each client would re-establish
> > state for those open files anyway, and the server would be in the
> > same overcommitted state it was in before it rebooted.
>
>
> When the number of opens gets close to the limit which would result in
> a disruptive NFSv4 service interruption ( currently 128K open files is the limit),
> I do the reboot ( actually I transfer the affected NFS serving resource
> from one NAS cluster-node to the other NAS cluster node ... this based on experience
> is like a 99.9% "non-disruptive reboot" of the affected NFS serving resource )
>
> Before the resource transfer there will be ~126K open files
> ( from the NAS perspective )
> 0.1 seconds after the resource transfer there will be
> close to zero files open. Within a few seconds there will
> be ~2000 and within a few minutes there will be ~2100.
> During the rest of the day I only see a slow rise in the average number
> of opens to maybe 2200. ( my take is ~2100 files were "active opens" before and after
> the resource transfer , the rest of the 126K opens were zombies
> that the clients were no longer using ). In 4-6 months
> the number of opens from the NAS perspective will slowly
> creep back up to the limit.

What you are describing sounds like a bug in a system (be it client or
server). There is state that the client thought it closed but the
server still keeping that state.

>
>
>
> >
> > We might not have an accurate root cause analysis yet, or I could
> > be missing something.
> >
> > --
> > Chuck Lever
> >
> >
>

2023-01-31 19:32:48

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files



> -----Original Message-----
> From: Chuck Lever III <[email protected]>
>
> Almost. The protocol requires:
>
> After the client reboots, when it opens its first file, the client
> does a SETCLIENTID or EXCHANGE_ID to establish its lease on the
> server. All OPEN and LOCK state is managed under the umbrella of
> that lease (and that includes all files that client is managing).
> The client keeps the lease alive by renewing the lease every minute.
>
> If the client reboots (ie, does a subsequent SETCLIENTID or
> EXCHANGE_ID with a new boot verifier), the server has to purge all
> open file state for that client.
>
> If the client fails to renew its lease, the server is free to do
> what it wants -- it can purge the client's lease immediately, or
> it can wait until conflicting opens or locks come from other
> clients and then purge some or all of that client's lease.
>
> If the client can't or doesn't CLOSE that file, it will remain
> on the server until the client tells it (implicitly by not
> renewing or explicitly with a fresh ID) that the state is no
> longer needed; or until the server reboots and the client does
> not re-establish the OPEN state.

So , in general, this is true:
- A lease is not "issued" for every file opened
- A lease is not "issued" for every user running on an NFS-client host
- In general. one lease is issued / managed for each NFS-client host
( if this is true, my server vendor is probably not forgetting to do
something they should be doing )


> But again, we need some way to confirm exactly how this is
> happening. Can you post your script, or capture client-server
> network traffic while the script does its thing?
>

The script is about simple as "hello world":

import sys
import fileinput
import os.path
import re
import time

def main():

StartID=int(raw_input("Enter Start ID: "))

TestDir=os.path.normcase('/nashome/r/romero/stuckopentest/dataout')

FPlist=[]

# open 2000 files and leave them open
for x in range(StartID, StartID+2000):

TestFilePath=os.path.join(TestDir, "TestFile-" + str(x))
print(TestFilePath)

# open file append file pointer to list
FPlist.append(open(TestFilePath,"w"))



# sleep for greater than Krb ticket life time
# 2000 files will be "stuck open" on the server
time.sleep(60*60*24)


main()


NOTE:

I don't expect people on this list to debug my issue.

My reason's for posting:

- Determine If my NAS vendor might be accidentally
not doing something they should be.
( I now don't really think this is the case. )


- Determine if this is a known behavior common to all NFS implementations
( Linux, ....etc ) and if so have you determine if this is a problem that should be addressed
in the spec and the implementations.



























2023-01-31 19:55:11

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files



> What you are describing sounds like a bug in a system (be it client or
> server). There is state that the client thought it closed but the
> server still keeping that state.

Hi Olga

Based on my simple test script experiment,
Here's a summary of what I believe is happening

1. An interactive user starts a process that opens a file or multiple files

2. A disruption, that prevents
NFS-client <-> NFS-server communication,
occurs while the file is open. This could be due to
having the file open a long time or due to opening the file
too close to the time of disruption.

( I believe the most common "disruption" is
credential expiration )

3) The user's process terminates before the disruption
is cleared. ( or stated another way , the disruption is not cleared until after the user
process terminates )

At the time the user process terminates, the process
can not tell the server to close the server-side file state.

After the process terminates, nothing will ever tell the server
to close the files. The now zombie open files will continue to
consume server-side resources.

In environments with many users, the problem is significant

My reasons for posting:

- Are not to have your team help troubleshoot my specific issue
( that would be quite rude )

they are:

- Determine If my NAS vendor might be accidentally
not doing something they should be.
( I now don't really think this is the case. )


- Determine if this is a known behavior common to all NFS implementations
( Linux, ....etc ) and if so have your team determine if this is a problem that should be addressed
in the spec and the implementations.



Andy





2023-01-31 21:31:47

by Frank Filz

[permalink] [raw]
Subject: RE: Zombie / Orphan open files

> On Tue, Jan 31, 2023 at 1:19 PM Frank Filz <[email protected]> wrote:
> >
> > > On Mon, Jan 30, 2023 at 5:44 PM Andrew J. Romero <[email protected]>
> wrote:
> > > >
> > > > Hi
> > > >
> > > > This is a quick general NFS server question.
> > > >
> > > > Does the NFSv4x specification require or recommend that: the NFS
> server,
> > > after some reasonable time,
> > > > should / must close orphan / zombie open files ?
> > >
> > > Why should the server be responsible for a badly behaving client? It
> > > seems like you are advocating for the world where a problem is hidden
> rather than solved.
> > > But because bugs do occur and some customers want a quick solution,
> > > some storage providers do have ways of dealing with releasing
> > > resources (like open
> > > state) that the client will never ask for again.
> > >
> > > Why should we excuse bad user behaviour? For things like long
> > > running jobs users have to be educated that their credentials must
> > > stay valid for the duration of their usage.
> > >
> > > Why should we excuse poor application behaviour that doesn't close
> > > files? But in a way we do, the OS will make sure that the file is
> > > closed when the application exists without explicitly closing the
> > > file. So I'm curious how do you get in a state with zombie?
> >
> > Don't automatically assume this is bad application behavior, though it may be
> behavior we don't all like, sometimes it may be for a reason. Applications may
> be keeping a file open to protect the file (works best when share deny modes
> are available, i.e. most likely a Windows client). Also, won't an executable be
> kept open for the lifetime of the process, especially if the executable is large
> enough that it will be paged in/out from the file? This assures the same
> executable is available for the lifetime of the process even if deleted and
> replaced with a new version.
>
> Aren't you describing is a long running job (a file that needs to be kept opened --
> and not closed -- for a long period of time)? And it's a user's responsibility to
> have creds that are long enough (or a system of renewal) that covers the
> duration of the job. To be clear you are talking about a long running process that
> keeps a file opened.
> You are not talking about a process that starts, opens a file and the process exits
> without closing a file. That's poor application behaviour I was referring too.
> Regardless in that situation OS cleans up. So I'm very curious how these
> zombie/orphan files are being created, how does it happens that the OS doesn't
> clean up.

Oh, OK, I see now I was confused, I see Andrew responded with a theory of what might be happening.

And yea, if the client is allowing credentials to expire while a file is still open, that's a problem.

Frank


2023-01-31 21:46:27

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files




> And yea, if the client is allowing credentials to expire while a file is still open, that's a problem.

Since users acquiring credentials, the credentials getting put into the kernel context, and the
user opening files are all asynchronous activities, Its likely that some of the time the issue
will not be due to bad user behavior ( leaving files open for wide periods of time ), it will
sometimes be simply a file being opened too close to credentials going away


>
> Frank

2023-01-31 22:14:20

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, Jan 31, 2023 at 2:55 PM Andrew J. Romero <[email protected]> wrote:
>
>
>
> > What you are describing sounds like a bug in a system (be it client or
> > server). There is state that the client thought it closed but the
> > server still keeping that state.
>
> Hi Olga
>
> Based on my simple test script experiment,
> Here's a summary of what I believe is happening
>
> 1. An interactive user starts a process that opens a file or multiple files
>
> 2. A disruption, that prevents
> NFS-client <-> NFS-server communication,
> occurs while the file is open. This could be due to
> having the file open a long time or due to opening the file
> too close to the time of disruption.
>
> ( I believe the most common "disruption" is
> credential expiration )
>
> 3) The user's process terminates before the disruption
> is cleared. ( or stated another way , the disruption is not cleared until after the user
> process terminates )
>
> At the time the user process terminates, the process
> can not tell the server to close the server-side file state.
>
> After the process terminates, nothing will ever tell the server
> to close the files. The now zombie open files will continue to
> consume server-side resources.
>
> In environments with many users, the problem is significant
>
> My reasons for posting:
>
> - Are not to have your team help troubleshoot my specific issue
> ( that would be quite rude )
>
> they are:
>
> - Determine If my NAS vendor might be accidentally
> not doing something they should be.
> ( I now don't really think this is the case. )

It's hard to say who's at fault here without having some more info
like tracepoints or network traces.

> - Determine if this is a known behavior common to all NFS implementations
> ( Linux, ....etc ) and if so have your team determine if this is a problem that should be addressed
> in the spec and the implementations.

What you describe --- having different views of state on the client
and server -- is not a known common behaviour.

I have tried it on my Kerberos setup.
Gotten a 5min ticket.
As a user opened a file in a process that went to sleep.
My user credentials have expired (after 5mins). I verified that by
doing an "ls" on a mounted filesystem which resulted in permission
denied error.
Then I killed the application that had an opened file. This resulted
in a NFS CLOSE being sent to the server using the machine's gss
context (which is a default behaviour of the linux client regardless
of whether or not user's credentials are valid).

Basically as far as I can tell, a linux client can handle cleaning up
state when user's credentials have expired.
>
>
>
> Andy
>
>
>
>
>

2023-01-31 22:26:49

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files

Hi Olga

This is great info !

Can you make sure that the host principal is not granted any
read or write access ( via ACL entry, owner, group or Everyone access)
to the actual directory and file being opened.

If, by spec or well established convention, the client host principal just needs to submit the "close request"
to the NFS server ; but, needs no access to the actual directory tree or files, then
my NAS vendor will need to take action.

If I need to grant directory / file access to all host principals on-site
I will probably get serious computer-security opposition.

Thanks !

Andy

>
> What you describe --- having different views of state on the client
> and server -- is not a known common behaviour.
>
> I have tried it on my Kerberos setup.
> Gotten a 5min ticket.
> As a user opened a file in a process that went to sleep.
> My user credentials have expired (after 5mins). I verified that by
> doing an "ls" on a mounted filesystem which resulted in permission
> denied error.
> Then I killed the application that had an opened file. This resulted
> in a NFS CLOSE being sent to the server using the machine's gss
> context (which is a default behaviour of the linux client regardless
> of whether or not user's credentials are valid).
>
> Basically as far as I can tell, a linux client can handle cleaning up
> state when user's credentials have expired.
> >
> >
> >
> > Andy
> >
> >
> >
> >
> >

2023-01-31 22:28:28

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, 2023-01-31 at 17:14 -0500, Olga Kornievskaia wrote:
> On Tue, Jan 31, 2023 at 2:55 PM Andrew J. Romero <[email protected]> wrote:
> >
> >
> >
> > > What you are describing sounds like a bug in a system (be it client or
> > > server). There is state that the client thought it closed but the
> > > server still keeping that state.
> >
> > Hi Olga
> >
> > Based on my simple test script experiment,
> > Here's a summary of what I believe is happening
> >
> > 1. An interactive user starts a process that opens a file or multiple files
> >
> > 2. A disruption, that prevents
> > NFS-client <-> NFS-server communication,
> > occurs while the file is open. This could be due to
> > having the file open a long time or due to opening the file
> > too close to the time of disruption.
> >
> > ( I believe the most common "disruption" is
> > credential expiration )
> >
> > 3) The user's process terminates before the disruption
> > is cleared. ( or stated another way , the disruption is not cleared until after the user
> > process terminates )
> >
> > At the time the user process terminates, the process
> > can not tell the server to close the server-side file state.
> >
> > After the process terminates, nothing will ever tell the server
> > to close the files. The now zombie open files will continue to
> > consume server-side resources.
> >
> > In environments with many users, the problem is significant
> >
> > My reasons for posting:
> >
> > - Are not to have your team help troubleshoot my specific issue
> > ( that would be quite rude )
> >
> > they are:
> >
> > - Determine If my NAS vendor might be accidentally
> > not doing something they should be.
> > ( I now don't really think this is the case. )
>
> It's hard to say who's at fault here without having some more info
> like tracepoints or network traces.
>
> > - Determine if this is a known behavior common to all NFS implementations
> > ( Linux, ....etc ) and if so have your team determine if this is a problem that should be addressed
> > in the spec and the implementations.
>
> What you describe --- having different views of state on the client
> and server -- is not a known common behaviour.
>
> I have tried it on my Kerberos setup.
> Gotten a 5min ticket.
> As a user opened a file in a process that went to sleep.
> My user credentials have expired (after 5mins). I verified that by
> doing an "ls" on a mounted filesystem which resulted in permission
> denied error.
> Then I killed the application that had an opened file. This resulted
> in a NFS CLOSE being sent to the server using the machine's gss
> context (which is a default behaviour of the linux client regardless
> of whether or not user's credentials are valid).
>
> Basically as far as I can tell, a linux client can handle cleaning up
> state when user's credentials have expired.
> >

That's pretty much what I expected from looking at the code. I think
this is done via the call to nfs4_state_protect. That calls:

if (test_bit(sp4_mode, &clp->cl_sp4_flags)) {
msg->rpc_cred = rpc_machine_cred();
...
}

Could it be that cl_sp4_flags doesn't have NFS_SP4_MACH_CRED_CLEANUP set
on his clients? AFAICT, that comes from the server. It also looks like
cl_sp4_flags may not get set on a NFSv4.0 mount.

Olga, can you test that with a v4.0 mount?
--
Jeff Layton <[email protected]>

2023-01-31 22:47:49

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

On Tue, Jan 31, 2023 at 5:26 PM Andrew J. Romero <[email protected]> wrote:
>
> Hi Olga
>
> This is great info !
>
> Can you make sure that the host principal is not granted any
> read or write access ( via ACL entry, owner, group or Everyone access)
> to the actual directory and file being opened.
>
> If, by spec or well established convention, the client host principal just needs to submit the "close request"
> to the NFS server ; but, needs no access to the actual directory tree or files, then
> my NAS vendor will need to take action.
>
> If I need to grant directory / file access to all host principals on-site
> I will probably get serious computer-security opposition.

Closing a file has nothing to do with having access to the file. As
per spec, doing state operations should be allowed by the machine
principal.

Here's the paragraph of the spec stating that things like CLOSE must be allowed:

In cases where the server's security policies on a portion of its
namespace require RPCSEC_GSS authentication, a client may have to use
an RPCSEC_GSS credential to remove per-file state (e.g., LOCKU, CLOSE,
etc.). The server may require that the principal that removes the
state match certain criteria (e.g., the principal might have to be the
same as the one that acquired the state). However, the client might
not have an RPCSEC_GSS context for such a principal, and might not be
able to create such a context (perhaps because the user has logged
off). When the client establishes SP4_MACH_CRED or SP4_SSV protection,
it can specify a list of operations that the server MUST allow using
the machine credential (if SP4_MACH_CRED is used) or the SSV
credential (if SP4_SSV is used).

If the NAS vendor is disallowing it then they are in the wrong.

>
> Thanks !
>
> Andy
>
> >
> > What you describe --- having different views of state on the client
> > and server -- is not a known common behaviour.
> >
> > I have tried it on my Kerberos setup.
> > Gotten a 5min ticket.
> > As a user opened a file in a process that went to sleep.
> > My user credentials have expired (after 5mins). I verified that by
> > doing an "ls" on a mounted filesystem which resulted in permission
> > denied error.
> > Then I killed the application that had an opened file. This resulted
> > in a NFS CLOSE being sent to the server using the machine's gss
> > context (which is a default behaviour of the linux client regardless
> > of whether or not user's credentials are valid).
> >
> > Basically as far as I can tell, a linux client can handle cleaning up
> > state when user's credentials have expired.
> > >
> > >
> > >
> > > Andy
> > >
> > >
> > >
> > >
> > >

2023-01-31 23:08:36

by Andrew J. Romero

[permalink] [raw]
Subject: RE: Zombie / Orphan open files

Hi Olga

Based on Jeff's post

Are there some NFS-client side flags that need to be set by
the sys-admins to have the state-operations performed
by the machine credential ?

Are there any server-side requirements that must be fulfilled
so that the correct behavior is negotiated between client and server ?

What versions of the client ( RHEL-7 , 8 ..) support this behavior
( state-ops performed by machine credential )

What versions of NFS ( 4.0, 4.1 .... ) support / mandate this behavior

Thanks Again

If any of you plan on visiting Illinois soon, I owe you lunch !

Andy


>
> Here's the paragraph of the spec stating that things like CLOSE must be allowed:
>
> In cases where the server's security policies on a portion of its
> namespace require RPCSEC_GSS authentication, a client may have to use
> an RPCSEC_GSS credential to remove per-file state (e.g., LOCKU, CLOSE,
> etc.). The server may require that the principal that removes the
> state match certain criteria (e.g., the principal might have to be the
> same as the one that acquired the state). However, the client might
> not have an RPCSEC_GSS context for such a principal, and might not be
> able to create such a context (perhaps because the user has logged
> off). When the client establishes SP4_MACH_CRED or SP4_SSV protection,
> it can specify a list of operations that the server MUST allow using
> the machine credential (if SP4_MACH_CRED is used) or the SSV
> credential (if SP4_SSV is used).
>
> If the NAS vendor is disallowing it then they are in the wrong.
>

2023-02-01 14:28:47

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

Hi Jeff,

There doesn't need to be anything done by the administrators (not for
the linux implementation). The negotiation is specified in the spec.
In the EXCHANGE_ID the client has a eia_state_protection field which
it sets to SP4_MACH_CREDS and provides 2 lists must_enforce and
must_allow (here's the spec):

"The second list is spo_must_allow and consists of those operations
the client wants to have the option of sending with the machine
credential or the SSV-based credential, even if the object the
operations are performed on is not owned by the machine or SSV
credential.

The corresponding result, also called spo_must_allow, consists of the
operations the server will allow the client to use SP4_SSV or
SP4_MACH_CRED credentials with. Normally, the server's result equals
the client's argument, but the result MAY be different.

The purpose of spo_must_allow is to allow clients to solve the
following conundrum. Suppose the client ID is confirmed with
EXCHGID4_FLAG_BIND_PRINC_STATEID, and it calls OPEN with the
RPCSEC_GSS credentials of a normal user. Now suppose the user's
credentials expire, and cannot be renewed (e.g., a Kerberos ticket
granting ticket expires, and the user has logged off and will not be
acquiring a new ticket granting ticket). The client will be unable to
send CLOSE without the user's credentials, which is to say the client
has to either leave the state on the server or re-send EXCHANGE_ID
with a new verifier to clear all state, that is, unless the client
includes CLOSE on the list of operations in spo_must_allow and the
server agrees."

It's possible that the NAS storage didn't allow for the CLOSE to be
done with the machine creds and thus without user creds the state
would be left open on the server. I suggest you capture a network
trace during a mount and check the content of the reply.

On Tue, Jan 31, 2023 at 6:08 PM Andrew J. Romero <[email protected]> wrote:
>
> Hi Olga
>
> Based on Jeff's post
>
> Are there some NFS-client side flags that need to be set by
> the sys-admins to have the state-operations performed
> by the machine credential ?
>
> Are there any server-side requirements that must be fulfilled
> so that the correct behavior is negotiated between client and server ?
>
> What versions of the client ( RHEL-7 , 8 ..) support this behavior
> ( state-ops performed by machine credential )
>
> What versions of NFS ( 4.0, 4.1 .... ) support / mandate this behavior
>
> Thanks Again
>
> If any of you plan on visiting Illinois soon, I owe you lunch !
>
> Andy
>
>
> >
> > Here's the paragraph of the spec stating that things like CLOSE must be allowed:
> >
> > In cases where the server's security policies on a portion of its
> > namespace require RPCSEC_GSS authentication, a client may have to use
> > an RPCSEC_GSS credential to remove per-file state (e.g., LOCKU, CLOSE,
> > etc.). The server may require that the principal that removes the
> > state match certain criteria (e.g., the principal might have to be the
> > same as the one that acquired the state). However, the client might
> > not have an RPCSEC_GSS context for such a principal, and might not be
> > able to create such a context (perhaps because the user has logged
> > off). When the client establishes SP4_MACH_CRED or SP4_SSV protection,
> > it can specify a list of operations that the server MUST allow using
> > the machine credential (if SP4_MACH_CRED is used) or the SSV
> > credential (if SP4_SSV is used).
> >
> > If the NAS vendor is disallowing it then they are in the wrong.
> >

2023-02-02 00:53:21

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Zombie / Orphan open files

adding back the mailing list.

On Wed, Feb 1, 2023 at 7:22 PM Andrew J. Romero <[email protected]> wrote:
>
>
>
> > > Is the default behavior, of using MACHINE CREDENTIALS
> > > for certain state control operations, in place when NFS v4.0 is being used ?
> > > or only if 4.1 or greater are being used ?
> >
> > This is only for 4.1+.
> >
>
> Our NAS is v4.0 .... they will release their 4.1 soon
>

Ah see not sure why I didn't ask on the thread what nfs version it was
and right away jumped to talking about 4.1+. With 4.0, operations such
as open/close would be done with the user's credentials. If they
expire, that's a problem. 4.1 protocol is preferred over 4.0 for
variety of reason and thing could be one of them.

I can confirm that the same experiment I've done with 4.1, with 4.0
leads to the client not sending an open to the server (since it has no
creds to use).

I don't believe there are any provisions in the 4.0 spec for allowing
something like a CLOSE to be sent with something other than the
principal's creds that created that state.

2023-02-02 18:16:34

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Jeff Layton <[email protected]>
> Sent: Monday, January 30, 2023 3:32 PM
>
> On Mon, 2023-01-30 at 20:03 +0000, Andrew Klaassen wrote:
> > > From: Jeff Layton <[email protected]>
> > > Sent: Monday, January 30, 2023 2:56 PM
> > >
> > > On Mon, 2023-01-30 at 19:33 +0000, Andrew Klaassen wrote:
> > > > > From: Jeff Layton <[email protected]>
> > > > > Sent: Friday, January 27, 2023 8:33 AM
> > > > >
> > > > > On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:
> > > > > > > From: Andrew Klaassen <[email protected]>
> > > > > > > Sent: Thursday, January 26, 2023 10:32 AM
> > > > > > >
> > > > > > > > From: Andrew Klaassen <[email protected]>
> > > > > > > > Sent: Monday, January 23, 2023 11:31 AM
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > There's a specific NFSv4 mount on a specific machine which
> > > > > > > > we'd like to timeout and return an error after a few
> > > > > > > > seconds if the server goes away.
> > > > > > > >
> > > > > > > > I've confirmed the following on two different kernels,
> > > > > > > > 4.18.0-
> > > > > > > > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> > > > > > > >
> > > > > > > > I've been able to get both autofs and the mount command to
> > > > > > > > cooperate, so that the mount attempt fails after an
> > > > > > > > arbitrary number of seconds.
> > > > > > > > This mount command, for example, will fail after 6
> > > > > > > > seconds, as expected based on the
> > > > > > > > timeo=20,retrans=2,retry=0 options:
> > > > > > > >
> > > > > > > > $ time sudo mount -t nfs4 -o
> > > > > > > > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen
> > > > > > > > =255
> > > > > > > > ,acr
> > > > > > > > egmi
> > > > > > > > n
> > > > > > > >
> > > > > > > =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,time
> > > > > > > o=20
> > > > > > > ,ret
> > > > > > > ra
> > > > > > > > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > > > > > > > mount.nfs4: Connection timed out
> > > > > > > >
> > > > > > > > real 0m6.084s
> > > > > > > > user 0m0.007s
> > > > > > > > sys 0m0.015s
> > > > > > > >
> > > > > > > > However, if the share is already mounted and the server
> > > > > > > > goes away, the timeout is always 2 minutes plus the time I
> > > > > > > > expect based on timeo and retrans. In this case, 2
> > > > > > > > minutes and 6
> > > > > > > > seconds:
> > > > > > > >
> > > > > > > > $ time ls /mnt/thor04
> > > > > > > > ls: cannot access '/mnt/thor04': Connection timed out
> > > > > > > >
> > > > > > > > real 2m6.025s
> > > > > > > > user 0m0.003s
> > > > > > > > sys 0m0.000s
> > > > > > > >
> > > > > > > > Watching the outgoing packets in the second case, the
> > > > > > > > pattern is always the
> > > > > > > > same:
> > > > > > > > - 0.2 seconds between the first two, then doubling each
> > > > > > > > time until the two minute mark is exceeded (so the last
> > > > > > > > NFS packet, which is always the 11th packet, is sent
> > > > > > > > around 1:45 after the first).
> > > > > > > > - Then some generic packets that start exactly-ish on the
> > > > > > > > two minute mark, 1 second between the first two, then
> > > > > > > > doubling each time.
> > > > > > > > (By
> > > > > > > > this time the NFS command has given up.)
> > > > > > > >
> > > > > > > > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834889483 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834889690 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834889898 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834890306 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834891154 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834892818 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834896082 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834902610 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834915922 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834942034 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags
> > > > > > > > [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > [nop,nop,TS val
> > > > > > > > 834996306 ecr 1589769203], length 200: NFS request xid
> > > > > > > > 3614904256
> > > > > > > > 196 getattr fh
> > > > > > > > 0,2/53
> > > > > > > > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > > 1460,sackOK,TS val
> > > > > > > > 835010130 ecr 0,nop,wscale 7], length 0
> > > > > > > > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > > 1460,sackOK,TS val
> > > > > > > > 835011155 ecr 0,nop,wscale 7], length 0
> > > > > > > > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > > 1460,sackOK,TS val
> > > > > > > > 835013202 ecr 0,nop,wscale 7], length 0
> > > > > > > > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > > 1460,sackOK,TS val
> > > > > > > > 835017234 ecr 0,nop,wscale 7], length 0
> > > > > > > > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > > 1460,sackOK,TS val
> > > > > > > > 835025490 ecr 0,nop,wscale 7], length 0
> > > > > > > > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > > 1460,sackOK,TS val
> > > > > > > > 835041874 ecr 0,nop,wscale 7], length 0
> > > > > > > > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags
> > > > > > > > [S], seq 1375256951, win 64240, options [mss
> > > > > > > > 1460,sackOK,TS val
> > > > > > > > 835074130 ecr 0,nop,wscale 7], length 0
> > > > > > > >
> > > > > > > > I tried changing tcp_retries2 as suggested in another
> > > > > > > > thread from this list:
> > > > > > > >
> > > > > > > > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> > > > > > > >
> > > > > > > > ...but it made no difference on either kernel. The 2
> > > > > > > > minute timeout also doesn't seem to match with what I'd
> > > > > > > > calculate from the initial value of tcp_retries2, which
> > > > > > > > should give a much higher timeout.
> > > > > > > >
> > > > > > > > The only clue I've been able to find is in the retry=n
> > > > > > > > entry in the NFS
> > > > > > > > manpage:
> > > > > > > >
> > > > > > > > " For TCP the default is 3 minutes, but system TCP
> > > > > > > > connection timeouts will sometimes limit the timeout of
> > > > > > > > each retransmission to around
> > > > > > > > 2
> > > > > > > minutes."
> > > > > > > >
> > > > > > > > What I'm not able to make sense of:
> > > > > > > > - The retry option says that it applies to mount
> > > > > > > > operations, not read/write operations. However, in this
> > > > > > > > case I'm seeing the 2 minute delay on read/write
> > > > > > > > operations but *not* mount operations.
> > > > > > > > - A couple of hours of searching didn't lead me to any
> > > > > > > > kernel settings that would result in a 2 minute timeout.
> > > > > > > >
> > > > > > > > Does anyone have any clues about a) what's happening and
> > > > > > > > b)
> > > > > > > > how to get our desired behaviour of being able to control
> > > > > > > > both mount and read/write timeouts down to a few seconds?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > >
> > > > > > > I thought that changing TCP_RTO_MAX in include/net/tcp.h
> > > > > > > from
> > > > > > > 120 to
> > > > > > > something smaller and recompiling the kernel would change
> > > > > > > the 2 minute timeout, but it had no effect. I'm going to
> > > > > > > keep poking through the kernel code to see if there's a knob
> > > > > > > I can turn to change the 2 minute timeout, so that I can at
> > > > > > > least understand where it's coming from.
> > > > > > >
> > > > > > > Any hints as to where I should be looking?
> > > > > >
> > > > > > I believe I've made some progress with this today:
> > > > > >
> > > > > > - Calls to rpc_create() from fs/nfs/client.c are sending an
> > > > > > rpc_timeout struct with their args.
> > > > > > - rpc_create() does *not* pass the timeout on to
> > > > > > xprt_create_transport(), which then can't pass it on to
> > > > > > xs_setup_tcp().
> > > > > > - xs_setup_tcp(), having no timeout passed to it, uses
> > > > > > xs_tcp_default_timeout instead.
> > > > > > - changing xs_tcp_default_timeout changes the "ls" timeout
> > > > > > behaviour
> > > > > > I described above.
> > > > > >
> > > > > > In theory all of this means that the timeout simply needs to
> > > > > > be
> > > > > > passed
> > > > > > through and used instead of xs_tcp_default_timeout. I'm going
> > > > > > to
> > > > > > give
> > > > > > this a try tomorrow.
> > > > > >
> > > > >
> > > > > That's a great root-cause analysis. The interlocking timeouts
> > > > > involved with
> > > > > NFS and its sockets can be really difficult to unwind.
> > > > >
> > > > > Is there a way to automate this testcase? That might be nice to
> > > > > have
> > > > > in
> > > > > xfstests or the nfstest suite.
> > > > >
> > > > > > Here's what I'm going to try first; I'm no C programmer,
> > > > > > though,
> > > > > > so
> > > > > > any advice or corrections you might have would be appreciated.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Andrew
> > > > > >
> > > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index
> > > > > > 0b0b9f1eed46..1350c1f489f7 100644
> > > > > > --- a/net/sunrpc/clnt.c
> > > > > > +++ b/net/sunrpc/clnt.c
> > > > > > @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct
> > > > > > rpc_create_args
> > > > > > *args)
> > > > > > .addrlen = args->addrsize,
> > > > > > .servername = args->servername,
> > > > > > .bc_xprt = args->bc_xprt,
> > > > > > + .timeout = args->timeout,
> > > > > > };
> > > > > > char servername[48];
> > > > > > struct rpc_clnt *clnt;
> > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > > > > index
> > > > > > aaa5b2741b79..adc79d94b59e 100644
> > > > > > --- a/net/sunrpc/xprtsock.c
> > > > > > +++ b/net/sunrpc/xprtsock.c
> > > > > > @@ -3003,7 +3003,7 @@ static struct rpc_xprt
> > > > > > *xs_setup_tcp(struct
> > > > > > xprt_create *args)
> > > > > > xprt->idle_timeout = XS_IDLE_DISC_TO;
> > > > > >
> > > > > > xprt->ops = &xs_tcp_ops;
> > > > > > - xprt->timeout = &xs_tcp_default_timeout;
> > > > > > + xprt->timeout = args->timeout;
> > > > > >
> > > > > > xprt->max_reconnect_timeout = xprt->timeout-
> > > > > > >to_maxval;
> > > > > > xprt->connect_timeout = xprt->timeout->to_initval *
> > > > > >
> > > > >
> > > > > Looks like you're probably on the right track. You're missing a
> > > > > few
> > > > > things:
> > > > >
> > > > > You'll need to add a "timeout" field to struct xprt_create in
> > > > > include/linux/sunrpc/xprt.h, and there may be some other places
> > > > > that
> > > > > either
> > > > > need to set the timeout in that structure, or do something with
> > > > > that
> > > > > field
> > > > > when it's set.
> > > > >
> > > > > Once you have something that fixes your reproducer, go ahead and
> > > > > post it
> > > > > and we can help you work through whatever changes need to me
> > > > > made
> > > to
> > > > > make it work.
> > > > >
> > > > > Nice work!
> > > >
> > > > Thanks for the tip, that was helpful.
> > > >
> > > > Currently I'm fighting with kernel recompilation. I decided to
> > > > make
> > > > it quicker by slimming down the config, but apparently I've
> > > > achieved
> > > > something which Google claims no one else has achieved:
> > > >
> > > > Errors on kernel make modules_install:
> > > >
> > > > DEPMOD /lib/modules/6.2.0-rc5-sunrpctimeo+
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs4_disable_idmapping
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs4_label_alloc
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > send_implementation_id
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs_atomic_open
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs_clear_verifier_delegated
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs4_client_id_uniquifier
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs4_dentry_operations
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs_fscache_open_file
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > nfs4_fs_type
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > recover_lost_locks
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs_callback_nr_threads
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > max_session_cb_slots
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > max_session_slots
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs_idmap_cache_timeout
> > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs_callback_set_tcpport
> > > >
> > > > Errors on module load:
> > > >
> > > > [ 94.008271] nfsv4: Unknown symbol nfs4_disable_idmapping (err -
> > > > 2)
> > > > [ 94.008321] nfsv4: Unknown symbol nfs4_label_alloc (err -2)
> > > > [ 94.008434] nfsv4: Unknown symbol send_implementation_id (err -
> > > > 2)
> > > > [ 94.008446] nfsv4: Unknown symbol nfs_atomic_open (err -2)
> > > > [ 94.008468] nfsv4: Unknown symbol nfs_clear_verifier_delegated
> > > > (err
> > > > -2)
> > > > [ 94.008475] nfsv4: Unknown symbol nfs4_client_id_uniquifier
> > > > (err -
> > > > 2)
> > > > [ 94.008501] nfsv4: Unknown symbol nfs4_dentry_operations (err -
> > > > 2)
> > > > [ 94.008521] nfsv4: Unknown symbol nfs_fscache_open_file (err -
> > > > 2)
> > > > [ 94.008566] nfsv4: Unknown symbol nfs4_fs_type (err -2)
> > > > [ 94.008595] nfsv4: Unknown symbol recover_lost_locks (err -2)
> > > > [ 94.008639] nfsv4: Unknown symbol nfs_callback_nr_threads (err
> > > > -2)
> > > > [ 94.008654] nfsv4: Unknown symbol max_session_cb_slots (err -2)
> > > > [ 94.008678] nfsv4: Unknown symbol max_session_slots (err -2)
> > > > [ 94.008694] nfsv4: Unknown symbol nfs_idmap_cache_timeout (err
> > > > -2)
> > > > [ 94.008709] nfsv4: Unknown symbol nfs_callback_set_tcpport (err
> > > > -2)
> > > >
> > > > I suspect I've turned something off in the config that I shouldn't
> > > > have, but I'm not sure what. I see that one of the symbols
> > > > (nfs_clear_verifier_delegated) is in include/linux/nfs_fs.h, and
> > > > the
> > > > others are defined in fs/nfs/nfs4_fs.h, fs/nfs/super.c,
> > > > fs/nfs/dir.c,
> > > > fs/nfs/inode.c, fs/nfs/fscache.c, and fs/nfs/fs_context.c. I'm
> > > > changing config options and recompiling to try to figure out what
> > > > I'm
> > > > missing, but at a couple of hours per compile and only a couple of
> > > > days a week to work on this it's slow going. Any hints as to what
> > > > I
> > > > might be doing wrong would be appreciated. ????
> > > >
> > >
> > > Looks like the ABI got broken when you turned off some options.
> > >
> > > Generally, if you just want to build a single module, then you want
> > > the
> > > .config to be _exactly_ the one that you used to build the kernel
> > > you're
> > > going to plug it into. Then to build the modules under fs/nfs you
> > > can
> > > do:
> > >
> > > make modules_prepare
> > > make M=fs/nfs
> > >
> > > ...and then drop the resulting .ko objects into the right place in
> > > /lib/modules.
> > >
> > > That said, it may be simpler to just build and work with a whole
> > > kernel
> > > for testing purposes. Working with an individual kmod can be a bit
> > > tricky unless you know what you're doing.
> > >
> > > Once you do the first, subsequent builds should be reasonably fast.
> >
> > I'm going to go back to a full kernel build with make oldconfig using
> > the distro's kernel config to try to avoid this latest issue, then try
> > what you've suggested to speed up recompiles.
> >
> > Since my changes are in net/sunrpc, should I be doing something like
> > this?
> >
> > make modules_prepare
> > make M=net/sunrpc
> > make M=fs/nfs
> >
> > Or do I not need to recompile nfs if I'm only touching the internals
> > of sunrpc?
> >
> > Thanks again.
> >
> > Andrew
> >
> >
> >
>
> You shouldn't need to build both *UNLESS* you change the ABI. That
> includes stuff like number of arguments to an exported function, or the
> size or layout of particular structures or arrays that both modules
> might be working with, etc...
>
> If you do that, then things can break in all sorts of "interesting" ways
> that can be very hard to track down. Without seeing your patch, it's
> hard to know whether you're breaking those rules here. YMMV, of course.
>
> Again, doing a full kernel build is the safest way to avoid that sort of
> thing. I'd council against shortcuts here unless you know what you're
> doing. Let the machine do the work. ;)

In the end I had to do a fresh compile with a different EXTRAVERSION to get rid of the "Unknown symbol" errors. I'm guessing it has something to do with leftovers in /boot or /lib/modules from an old build, but I didn't pursue it.

After a few tries, I finally got a version that didn't put random values in the timeout or make the kernel blow up. Now the compiler gives me a warning about "assignment discards ‘const’ qualifier from pointer target type", but it works. I'd be happy to get advice on how to make it better. (Keeping in mind that I'm far from a C programmer. :-) )

Here's the patch so far:

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index b9f59aabee53..5b7fdaff9267 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -333,6 +333,7 @@ struct xprt_create {
struct svc_xprt *bc_xprt; /* NFSv4.1 backchannel */
struct rpc_xprt_switch *bc_xps;
unsigned int flags;
+ struct rpc_timeout *timeout; /* timeout parms */
};

struct xprt_class {
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0b0b9f1eed46..1350c1f489f7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
.addrlen = args->addrsize,
.servername = args->servername,
.bc_xprt = args->bc_xprt,
+ .timeout = args->timeout,
};
char servername[48];
struct rpc_clnt *clnt;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aaa5b2741b79..0011596be4ad 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2174,6 +2174,17 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt *xprt,
unsigned int keepcnt;
unsigned int timeo;

+ dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_initval: %lu\n",
+ xprt->timeout->to_initval);
+ dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_maxval: %lu\n",
+ xprt->timeout->to_maxval);
+ dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_increment: %lu\n",
+ xprt->timeout->to_increment);
+ dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_retries: %u\n",
+ xprt->timeout->to_retries);
+ dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_exponential: %u\n",
+ xprt->timeout->to_exponential);
+
spin_lock(&xprt->transport_lock);
keepidle = DIV_ROUND_UP(xprt->timeout->to_initval, HZ);
keepcnt = xprt->timeout->to_retries + 1;
@@ -3003,7 +3014,24 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
xprt->idle_timeout = XS_IDLE_DISC_TO;

xprt->ops = &xs_tcp_ops;
- xprt->timeout = &xs_tcp_default_timeout;
+
+ if (args->timeout) {
+ xprt->timeout = kmalloc(sizeof(struct rpc_timeout), GFP_KERNEL);
+ memcpy(xprt->timeout, args->timeout, sizeof(struct rpc_timeout));
+ } else {
+ xprt->timeout = &xs_tcp_default_timeout;
+ }
+
+ dprintk("xs_setup_tcp: xprt->timeout->to_initval: %lu\n",
+ xprt->timeout->to_initval);
+ dprintk("xs_setup_tcp: xprt->timeout->to_maxval: %lu\n",
+ xprt->timeout->to_maxval);
+ dprintk("xs_setup_tcp: xprt->timeout->to_increment: %lu\n",
+ xprt->timeout->to_increment);
+ dprintk("xs_setup_tcp: xprt->timeout->to_retries: %u\n",
+ xprt->timeout->to_retries);
+ dprintk("xs_setup_tcp: xprt->timeout->to_exponential: %u\n",
+ xprt->timeout->to_exponential);
xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
xprt->connect_timeout = xprt->timeout->to_initval *

2023-02-06 15:28:00

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Andrew Klaassen <[email protected]>
> Sent: Thursday, February 2, 2023 1:16 PM
>
> > From: Jeff Layton <[email protected]>
> > Sent: Monday, January 30, 2023 3:32 PM
> >
> > On Mon, 2023-01-30 at 20:03 +0000, Andrew Klaassen wrote:
> > > > From: Jeff Layton <[email protected]>
> > > > Sent: Monday, January 30, 2023 2:56 PM
> > > >
> > > > On Mon, 2023-01-30 at 19:33 +0000, Andrew Klaassen wrote:
> > > > > > From: Jeff Layton <[email protected]>
> > > > > > Sent: Friday, January 27, 2023 8:33 AM
> > > > > >
> > > > > > On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:
> > > > > > > > From: Andrew Klaassen <[email protected]>
> > > > > > > > Sent: Thursday, January 26, 2023 10:32 AM
> > > > > > > >
> > > > > > > > > From: Andrew Klaassen <[email protected]>
> > > > > > > > > Sent: Monday, January 23, 2023 11:31 AM
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > There's a specific NFSv4 mount on a specific machine
> > > > > > > > > which we'd like to timeout and return an error after a
> > > > > > > > > few seconds if the server goes away.
> > > > > > > > >
> > > > > > > > > I've confirmed the following on two different kernels,
> > > > > > > > > 4.18.0-
> > > > > > > > > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> > > > > > > > >
> > > > > > > > > I've been able to get both autofs and the mount command
> > > > > > > > > to cooperate, so that the mount attempt fails after an
> > > > > > > > > arbitrary number of seconds.
> > > > > > > > > This mount command, for example, will fail after 6
> > > > > > > > > seconds, as expected based on the
> > > > > > > > > timeo=20,retrans=2,retry=0 options:
> > > > > > > > >
> > > > > > > > > $ time sudo mount -t nfs4 -o
> > > > > > > > > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,naml
> > > > > > > > > en
> > > > > > > > > =255
> > > > > > > > > ,acr
> > > > > > > > > egmi
> > > > > > > > > n
> > > > > > > > >
> > > > > > > > =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,ti
> > > > > > > > me
> > > > > > > > o=20
> > > > > > > > ,ret
> > > > > > > > ra
> > > > > > > > > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04 /mnt/thor04
> > > > > > > > > mount.nfs4: Connection timed out
> > > > > > > > >
> > > > > > > > > real 0m6.084s
> > > > > > > > > user 0m0.007s
> > > > > > > > > sys 0m0.015s
> > > > > > > > >
> > > > > > > > > However, if the share is already mounted and the server
> > > > > > > > > goes away, the timeout is always 2 minutes plus the time
> > > > > > > > > I expect based on timeo and retrans. In this case, 2
> > > > > > > > > minutes and 6
> > > > > > > > > seconds:
> > > > > > > > >
> > > > > > > > > $ time ls /mnt/thor04
> > > > > > > > > ls: cannot access '/mnt/thor04': Connection timed out
> > > > > > > > >
> > > > > > > > > real 2m6.025s
> > > > > > > > > user 0m0.003s
> > > > > > > > > sys 0m0.000s
> > > > > > > > >
> > > > > > > > > Watching the outgoing packets in the second case, the
> > > > > > > > > pattern is always the
> > > > > > > > > same:
> > > > > > > > > - 0.2 seconds between the first two, then doubling each
> > > > > > > > > time until the two minute mark is exceeded (so the last
> > > > > > > > > NFS packet, which is always the 11th packet, is sent
> > > > > > > > > around 1:45 after the first).
> > > > > > > > > - Then some generic packets that start exactly-ish on
> > > > > > > > > the two minute mark, 1 second between the first two,
> > > > > > > > > then doubling each time.
> > > > > > > > > (By
> > > > > > > > > this time the NFS command has given up.)
> > > > > > > > >
> > > > > > > > > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834889483 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834889690 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834889898 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834890306 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834891154 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834892818 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834896082 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834902610 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834915922 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834942034 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049:
> > > > > > > > > Flags [P.], seq 14452:14652, ack 18561, win 501, options
> > > > > > > > > [nop,nop,TS val
> > > > > > > > > 834996306 ecr 1589769203], length 200: NFS request xid
> > > > > > > > > 3614904256
> > > > > > > > > 196 getattr fh
> > > > > > > > > 0,2/53
> > > > > > > > > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049:
> > > > > > > > > Flags [S], seq 1375256951, win 64240, options [mss
> > > > > > > > > 1460,sackOK,TS val
> > > > > > > > > 835010130 ecr 0,nop,wscale 7], length 0
> > > > > > > > > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049:
> > > > > > > > > Flags [S], seq 1375256951, win 64240, options [mss
> > > > > > > > > 1460,sackOK,TS val
> > > > > > > > > 835011155 ecr 0,nop,wscale 7], length 0
> > > > > > > > > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049:
> > > > > > > > > Flags [S], seq 1375256951, win 64240, options [mss
> > > > > > > > > 1460,sackOK,TS val
> > > > > > > > > 835013202 ecr 0,nop,wscale 7], length 0
> > > > > > > > > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049:
> > > > > > > > > Flags [S], seq 1375256951, win 64240, options [mss
> > > > > > > > > 1460,sackOK,TS val
> > > > > > > > > 835017234 ecr 0,nop,wscale 7], length 0
> > > > > > > > > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049:
> > > > > > > > > Flags [S], seq 1375256951, win 64240, options [mss
> > > > > > > > > 1460,sackOK,TS val
> > > > > > > > > 835025490 ecr 0,nop,wscale 7], length 0
> > > > > > > > > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049:
> > > > > > > > > Flags [S], seq 1375256951, win 64240, options [mss
> > > > > > > > > 1460,sackOK,TS val
> > > > > > > > > 835041874 ecr 0,nop,wscale 7], length 0
> > > > > > > > > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049:
> > > > > > > > > Flags [S], seq 1375256951, win 64240, options [mss
> > > > > > > > > 1460,sackOK,TS val
> > > > > > > > > 835074130 ecr 0,nop,wscale 7], length 0
> > > > > > > > >
> > > > > > > > > I tried changing tcp_retries2 as suggested in another
> > > > > > > > > thread from this list:
> > > > > > > > >
> > > > > > > > > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> > > > > > > > >
> > > > > > > > > ...but it made no difference on either kernel. The 2
> > > > > > > > > minute timeout also doesn't seem to match with what I'd
> > > > > > > > > calculate from the initial value of tcp_retries2, which
> > > > > > > > > should give a much higher timeout.
> > > > > > > > >
> > > > > > > > > The only clue I've been able to find is in the retry=n
> > > > > > > > > entry in the NFS
> > > > > > > > > manpage:
> > > > > > > > >
> > > > > > > > > " For TCP the default is 3 minutes, but system TCP
> > > > > > > > > connection timeouts will sometimes limit the timeout of
> > > > > > > > > each retransmission to around
> > > > > > > > > 2
> > > > > > > > minutes."
> > > > > > > > >
> > > > > > > > > What I'm not able to make sense of:
> > > > > > > > > - The retry option says that it applies to mount
> > > > > > > > > operations, not read/write operations. However, in this
> > > > > > > > > case I'm seeing the 2 minute delay on read/write
> > > > > > > > > operations but *not* mount operations.
> > > > > > > > > - A couple of hours of searching didn't lead me to any
> > > > > > > > > kernel settings that would result in a 2 minute timeout.
> > > > > > > > >
> > > > > > > > > Does anyone have any clues about a) what's happening and
> > > > > > > > > b)
> > > > > > > > > how to get our desired behaviour of being able to
> > > > > > > > > control both mount and read/write timeouts down to a few
> seconds?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > I thought that changing TCP_RTO_MAX in include/net/tcp.h
> > > > > > > > from
> > > > > > > > 120 to
> > > > > > > > something smaller and recompiling the kernel would change
> > > > > > > > the 2 minute timeout, but it had no effect. I'm going to
> > > > > > > > keep poking through the kernel code to see if there's a
> > > > > > > > knob I can turn to change the 2 minute timeout, so that I
> > > > > > > > can at least understand where it's coming from.
> > > > > > > >
> > > > > > > > Any hints as to where I should be looking?
> > > > > > >
> > > > > > > I believe I've made some progress with this today:
> > > > > > >
> > > > > > > - Calls to rpc_create() from fs/nfs/client.c are sending an
> > > > > > > rpc_timeout struct with their args.
> > > > > > > - rpc_create() does *not* pass the timeout on to
> > > > > > > xprt_create_transport(), which then can't pass it on to
> > > > > > > xs_setup_tcp().
> > > > > > > - xs_setup_tcp(), having no timeout passed to it, uses
> > > > > > > xs_tcp_default_timeout instead.
> > > > > > > - changing xs_tcp_default_timeout changes the "ls" timeout
> > > > > > > behaviour I described above.
> > > > > > >
> > > > > > > In theory all of this means that the timeout simply needs to
> > > > > > > be passed through and used instead of
> > > > > > > xs_tcp_default_timeout. I'm going to give this a try
> > > > > > > tomorrow.
> > > > > > >
> > > > > >
> > > > > > That's a great root-cause analysis. The interlocking timeouts
> > > > > > involved with NFS and its sockets can be really difficult to
> > > > > > unwind.
> > > > > >
> > > > > > Is there a way to automate this testcase? That might be nice
> > > > > > to have in xfstests or the nfstest suite.
> > > > > >
> > > > > > > Here's what I'm going to try first; I'm no C programmer,
> > > > > > > though, so any advice or corrections you might have would be
> > > > > > > appreciated.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > Andrew
> > > > > > >
> > > > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index
> > > > > > > 0b0b9f1eed46..1350c1f489f7 100644
> > > > > > > --- a/net/sunrpc/clnt.c
> > > > > > > +++ b/net/sunrpc/clnt.c
> > > > > > > @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct
> > > > > > > rpc_create_args
> > > > > > > *args)
> > > > > > > .addrlen = args->addrsize,
> > > > > > > .servername = args->servername,
> > > > > > > .bc_xprt = args->bc_xprt,
> > > > > > > + .timeout = args->timeout,
> > > > > > > };
> > > > > > > char servername[48];
> > > > > > > struct rpc_clnt *clnt; diff --git
> > > > > > > a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index
> > > > > > > aaa5b2741b79..adc79d94b59e 100644
> > > > > > > --- a/net/sunrpc/xprtsock.c
> > > > > > > +++ b/net/sunrpc/xprtsock.c
> > > > > > > @@ -3003,7 +3003,7 @@ static struct rpc_xprt
> > > > > > > *xs_setup_tcp(struct xprt_create *args)
> > > > > > > xprt->idle_timeout = XS_IDLE_DISC_TO;
> > > > > > >
> > > > > > > xprt->ops = &xs_tcp_ops;
> > > > > > > - xprt->timeout = &xs_tcp_default_timeout;
> > > > > > > + xprt->timeout = args->timeout;
> > > > > > >
> > > > > > > xprt->max_reconnect_timeout = xprt->timeout-
> > > > > > > >to_maxval;
> > > > > > > xprt->connect_timeout = xprt->timeout->to_initval *
> > > > > > >
> > > > > >
> > > > > > Looks like you're probably on the right track. You're missing
> > > > > > a few
> > > > > > things:
> > > > > >
> > > > > > You'll need to add a "timeout" field to struct xprt_create in
> > > > > > include/linux/sunrpc/xprt.h, and there may be some other
> > > > > > places that either need to set the timeout in that structure,
> > > > > > or do something with that field when it's set.
> > > > > >
> > > > > > Once you have something that fixes your reproducer, go ahead
> > > > > > and post it and we can help you work through whatever changes
> > > > > > need to me made
> > > > to
> > > > > > make it work.
> > > > > >
> > > > > > Nice work!
> > > > >
> > > > > Thanks for the tip, that was helpful.
> > > > >
> > > > > Currently I'm fighting with kernel recompilation. I decided to
> > > > > make it quicker by slimming down the config, but apparently I've
> > > > > achieved something which Google claims no one else has achieved:
> > > > >
> > > > > Errors on kernel make modules_install:
> > > > >
> > > > > DEPMOD /lib/modules/6.2.0-rc5-sunrpctimeo+
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs4_disable_idmapping
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs4_label_alloc
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > send_implementation_id
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs_atomic_open
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs_clear_verifier_delegated
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs4_client_id_uniquifier
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs4_dentry_operations
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs_fscache_open_file
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > nfs4_fs_type
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > recover_lost_locks
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs_callback_nr_threads
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > max_session_cb_slots
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > max_session_slots
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs_idmap_cache_timeout
> > > > > depmod: WARNING: /lib/modules/6.2.0-rc5-
> > > > > sunrpctimeo+/kernel/fs/nfs/nfsv4.ko needs unknown symbol
> > > > > nfs_callback_set_tcpport
> > > > >
> > > > > Errors on module load:
> > > > >
> > > > > [ 94.008271] nfsv4: Unknown symbol nfs4_disable_idmapping (err -
> > > > > 2)
> > > > > [ 94.008321] nfsv4: Unknown symbol nfs4_label_alloc (err -2)
> > > > > [ 94.008434] nfsv4: Unknown symbol send_implementation_id (err -
> > > > > 2)
> > > > > [ 94.008446] nfsv4: Unknown symbol nfs_atomic_open (err -2)
> > > > > [ 94.008468] nfsv4: Unknown symbol nfs_clear_verifier_delegated
> > > > > (err
> > > > > -2)
> > > > > [ 94.008475] nfsv4: Unknown symbol nfs4_client_id_uniquifier
> > > > > (err -
> > > > > 2)
> > > > > [ 94.008501] nfsv4: Unknown symbol nfs4_dentry_operations (err -
> > > > > 2)
> > > > > [ 94.008521] nfsv4: Unknown symbol nfs_fscache_open_file (err -
> > > > > 2)
> > > > > [ 94.008566] nfsv4: Unknown symbol nfs4_fs_type (err -2)
> > > > > [ 94.008595] nfsv4: Unknown symbol recover_lost_locks (err -2)
> > > > > [ 94.008639] nfsv4: Unknown symbol nfs_callback_nr_threads (err
> > > > > -2)
> > > > > [ 94.008654] nfsv4: Unknown symbol max_session_cb_slots (err -2)
> > > > > [ 94.008678] nfsv4: Unknown symbol max_session_slots (err -2)
> > > > > [ 94.008694] nfsv4: Unknown symbol nfs_idmap_cache_timeout (err
> > > > > -2)
> > > > > [ 94.008709] nfsv4: Unknown symbol nfs_callback_set_tcpport (err
> > > > > -2)
> > > > >
> > > > > I suspect I've turned something off in the config that I
> > > > > shouldn't have, but I'm not sure what. I see that one of the
> > > > > symbols
> > > > > (nfs_clear_verifier_delegated) is in include/linux/nfs_fs.h, and
> > > > > the others are defined in fs/nfs/nfs4_fs.h, fs/nfs/super.c,
> > > > > fs/nfs/dir.c, fs/nfs/inode.c, fs/nfs/fscache.c, and
> > > > > fs/nfs/fs_context.c. I'm changing config options and
> > > > > recompiling to try to figure out what I'm missing, but at a
> > > > > couple of hours per compile and only a couple of days a week to
> > > > > work on this it's slow going. Any hints as to what I might be
> > > > > doing wrong would be appreciated. ????
> > > > >
> > > >
> > > > Looks like the ABI got broken when you turned off some options.
> > > >
> > > > Generally, if you just want to build a single module, then you
> > > > want the .config to be _exactly_ the one that you used to build
> > > > the kernel you're going to plug it into. Then to build the modules
> > > > under fs/nfs you can
> > > > do:
> > > >
> > > > make modules_prepare
> > > > make M=fs/nfs
> > > >
> > > > ...and then drop the resulting .ko objects into the right place in
> > > > /lib/modules.
> > > >
> > > > That said, it may be simpler to just build and work with a whole
> > > > kernel for testing purposes. Working with an individual kmod can
> > > > be a bit tricky unless you know what you're doing.
> > > >
> > > > Once you do the first, subsequent builds should be reasonably fast.
> > >
> > > I'm going to go back to a full kernel build with make oldconfig
> > > using the distro's kernel config to try to avoid this latest issue,
> > > then try what you've suggested to speed up recompiles.
> > >
> > > Since my changes are in net/sunrpc, should I be doing something like
> > > this?
> > >
> > > make modules_prepare
> > > make M=net/sunrpc
> > > make M=fs/nfs
> > >
> > > Or do I not need to recompile nfs if I'm only touching the internals
> > > of sunrpc?
> > >
> > > Thanks again.
> > >
> > > Andrew
> > >
> > >
> > >
> >
> > You shouldn't need to build both *UNLESS* you change the ABI. That
> > includes stuff like number of arguments to an exported function, or
> > the size or layout of particular structures or arrays that both
> > modules might be working with, etc...
> >
> > If you do that, then things can break in all sorts of "interesting"
> > ways that can be very hard to track down. Without seeing your patch,
> > it's hard to know whether you're breaking those rules here. YMMV, of
> course.
> >
> > Again, doing a full kernel build is the safest way to avoid that sort
> > of thing. I'd council against shortcuts here unless you know what
> > you're doing. Let the machine do the work. ;)
>
> In the end I had to do a fresh compile with a different EXTRAVERSION to get
> rid of the "Unknown symbol" errors. I'm guessing it has something to do
> with leftovers in /boot or /lib/modules from an old build, but I didn't pursue
> it.
>
> After a few tries, I finally got a version that didn't put random values in the
> timeout or make the kernel blow up. Now the compiler gives me a warning
> about "assignment discards ‘const’ qualifier from pointer target type", but it
> works. I'd be happy to get advice on how to make it better. (Keeping in
> mind that I'm far from a C programmer. :-) )
>
> Here's the patch so far:
>
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h index
> b9f59aabee53..5b7fdaff9267 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -333,6 +333,7 @@ struct xprt_create {
> struct svc_xprt *bc_xprt; /* NFSv4.1 backchannel */
> struct rpc_xprt_switch *bc_xps;
> unsigned int flags;
> + struct rpc_timeout *timeout; /* timeout parms */
> };
>
> struct xprt_class {
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index
> 0b0b9f1eed46..1350c1f489f7 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args
> *args)
> .addrlen = args->addrsize,
> .servername = args->servername,
> .bc_xprt = args->bc_xprt,
> + .timeout = args->timeout,
> };
> char servername[48];
> struct rpc_clnt *clnt;
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index
> aaa5b2741b79..0011596be4ad 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -2174,6 +2174,17 @@ static void xs_tcp_set_socket_timeouts(struct
> rpc_xprt *xprt,
> unsigned int keepcnt;
> unsigned int timeo;
>
> + dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_initval:
> %lu\n",
> + xprt->timeout->to_initval);
> + dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_maxval:
> %lu\n",
> + xprt->timeout->to_maxval);
> + dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_increment:
> %lu\n",
> + xprt->timeout->to_increment);
> + dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_retries:
> %u\n",
> + xprt->timeout->to_retries);
> + dprintk("xs_tcp_set_socket_timeouts: xprt->timeout->to_exponential:
> %u\n",
> + xprt->timeout->to_exponential);
> +
> spin_lock(&xprt->transport_lock);
> keepidle = DIV_ROUND_UP(xprt->timeout->to_initval, HZ);
> keepcnt = xprt->timeout->to_retries + 1; @@ -3003,7 +3014,24 @@
> static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
> xprt->idle_timeout = XS_IDLE_DISC_TO;
>
> xprt->ops = &xs_tcp_ops;
> - xprt->timeout = &xs_tcp_default_timeout;
> +
> + if (args->timeout) {
> + xprt->timeout = kmalloc(sizeof(struct rpc_timeout), GFP_KERNEL);
> + memcpy(xprt->timeout, args->timeout, sizeof(struct rpc_timeout));
> + } else {
> + xprt->timeout = &xs_tcp_default_timeout;
> + }
> +
> + dprintk("xs_setup_tcp: xprt->timeout->to_initval: %lu\n",
> + xprt->timeout->to_initval);
> + dprintk("xs_setup_tcp: xprt->timeout->to_maxval: %lu\n",
> + xprt->timeout->to_maxval);
> + dprintk("xs_setup_tcp: xprt->timeout->to_increment: %lu\n",
> + xprt->timeout->to_increment);
> + dprintk("xs_setup_tcp: xprt->timeout->to_retries: %u\n",
> + xprt->timeout->to_retries);
> + dprintk("xs_setup_tcp: xprt->timeout->to_exponential: %u\n",
> + xprt->timeout->to_exponential);
> xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
> xprt->connect_timeout = xprt->timeout->to_initval *

I have eliminated the warnings about discarding const. However, it has resulted in a couple of questions that I hope someone can answer:

- I'm allocating memory. I assume that means I should free it somewhere. But where? In xprt_destroy(), which appears to do cleanup? Or in xprt_destroy_cb(), which is called from xprt_destroy() and which frees xprt->servername? Or somewhere else completely?
- If I free the allocated memory, will that cause any problems in the cases where no timeout is passed in via the args and the static const struct xs_tcp_default_timeout is assigned to xprt->timeout?
- If freeing the static const struct default will cause a problem, what should I do instead? Allocate and memcpy even when assigning the default? And would that mean doing the same thing for all the other transports that are setting timeouts (local, udp, tcp, and bc_tcp)?

Thanks for anyone who has advice.

Latest diff below. Thanks.

Andrew


diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index b9f59aabee53..0b30db910af3 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -333,6 +333,7 @@ struct xprt_create {
struct svc_xprt *bc_xprt; /* NFSv4.1 backchannel */
struct rpc_xprt_switch *bc_xps;
unsigned int flags;
+ const struct rpc_timeout *timeout; /* timeout parms */
};

struct xprt_class {
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0b0b9f1eed46..1350c1f489f7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
.addrlen = args->addrsize,
.servername = args->servername,
.bc_xprt = args->bc_xprt,
+ .timeout = args->timeout,
};
char servername[48];
struct rpc_clnt *clnt;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aaa5b2741b79..457b9e0d72ba 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2983,6 +2983,7 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
struct rpc_xprt *xprt;
struct sock_xprt *transport;
struct rpc_xprt *ret;
+ struct rpc_timeout *timeout;
unsigned int max_slot_table_size = xprt_max_tcp_slot_table_entries;

if (args->flags & XPRT_CREATE_INFINITE_SLOTS)
@@ -3003,7 +3004,14 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
xprt->idle_timeout = XS_IDLE_DISC_TO;

xprt->ops = &xs_tcp_ops;
- xprt->timeout = &xs_tcp_default_timeout;
+
+ if (args->timeout) {
+ timeout = kmalloc(sizeof(struct rpc_timeout), GFP_KERNEL);
+ memcpy(timeout, args->timeout, sizeof(struct rpc_timeout));
+ xprt->timeout = timeout;
+ } else {
+ xprt->timeout = &xs_tcp_default_timeout;
+ }

xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
xprt->connect_timeout = xprt->timeout->to_initval *





2023-02-06 17:19:02

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Andrew Klaassen <[email protected]>
> Sent: Monday, February 6, 2023 10:28 AM
>

> [snipping for readability; hope that's okay]
>
> - I'm allocating memory. I assume that means I should free it somewhere.
> But where? In xprt_destroy(), which appears to do cleanup? Or in
> xprt_destroy_cb(), which is called from xprt_destroy() and which frees xprt-
> >servername? Or somewhere else completely?
> - If I free the allocated memory, will that cause any problems in the cases
> where no timeout is passed in via the args and the static const struct
> xs_tcp_default_timeout is assigned to xprt->timeout?
> - If freeing the static const struct default will cause a problem, what should I
> do instead? Allocate and memcpy even when assigning the default? And
> would that mean doing the same thing for all the other transports that are
> setting timeouts (local, udp, tcp, and bc_tcp)?

Here's my best guess as the answer to my questions. Any advice/feedback appreciated.

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index b9f59aabee53..4543ec07cc12 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -333,6 +333,7 @@ struct xprt_create {
struct svc_xprt *bc_xprt; /* NFSv4.1 backchannel */
struct rpc_xprt_switch *bc_xps;
unsigned int flags;
+ const struct rpc_timeout *timeout; /* timeout parms */
};

struct xprt_class {
@@ -373,6 +374,8 @@ void xprt_release_xprt_cong(struct rpc_xprt *xprt, struct rpc_task *task);
void xprt_release(struct rpc_task *task);
struct rpc_xprt * xprt_get(struct rpc_xprt *xprt);
void xprt_put(struct rpc_xprt *xprt);
+struct rpc_timeout * xprt_alloc_timeout(const struct rpc_timeout * timeo,
+ const struct rpc_timeout *default_timeo);
struct rpc_xprt * xprt_alloc(struct net *net, size_t size,
unsigned int num_prealloc,
unsigned int max_req);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0b0b9f1eed46..1350c1f489f7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
.addrlen = args->addrsize,
.servername = args->servername,
.bc_xprt = args->bc_xprt,
+ .timeout = args->timeout,
};
char servername[48];
struct rpc_clnt *clnt;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index ab453ede54f0..1065b76ddff4 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1801,6 +1801,26 @@ static void xprt_free_id(struct rpc_xprt *xprt)
ida_free(&rpc_xprt_ids, xprt->id);
}

+struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout *timeo,
+ const struct rpc_timeout *default_timeo)
+{
+ struct rpc_timeout *timeout;
+ timeout = kzalloc(sizeof(struct rpc_timeout), GFP_KERNEL);
+ if (timeout == NULL)
+ return ERR_PTR(-ENOMEM);
+ if (timeo)
+ memcpy(timeout, timeo, sizeof(struct rpc_timeout));
+ else
+ memcpy(timeout, default_timeo, sizeof(struct rpc_timeout));
+ return timeout;
+}
+
+static void xprt_free_timeout(struct rpc_xprt *xprt)
+{
+ if (xprt->timeout != NULL)
+ kfree(xprt->timeout);
+}
+
struct rpc_xprt *xprt_alloc(struct net *net, size_t size,
unsigned int num_prealloc,
unsigned int max_alloc)
@@ -1837,6 +1857,7 @@ EXPORT_SYMBOL_GPL(xprt_alloc);

void xprt_free(struct rpc_xprt *xprt)
{
+ xprt_free_timeout(xprt);
put_net_track(xprt->xprt_net, &xprt->ns_tracker);
xprt_free_all_slots(xprt);
xprt_free_id(xprt);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aaa5b2741b79..ba05258509fa 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -3003,7 +3003,11 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
xprt->idle_timeout = XS_IDLE_DISC_TO;

xprt->ops = &xs_tcp_ops;
- xprt->timeout = &xs_tcp_default_timeout;
+
+ xprt->timeout = xprt_alloc_timeout(args->timeout, &xs_tcp_default_timeout);
+
+ if (IS_ERR(xprt->timeout))
+ goto out_err;

xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
xprt->connect_timeout = xprt->timeout->to_initval *

2023-02-27 14:48:22

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Andrew Klaassen <[email protected]>
> Sent: Monday, February 6, 2023 12:19 PM
>
> > From: Andrew Klaassen <[email protected]>
> > Sent: Monday, February 6, 2023 10:28 AM
> >
>
> > [snipping for readability; hope that's okay]
> >
> > - I'm allocating memory. I assume that means I should free it somewhere.
> > But where? In xprt_destroy(), which appears to do cleanup? Or in
> > xprt_destroy_cb(), which is called from xprt_destroy() and which frees
> > xprt-
> > >servername? Or somewhere else completely?
> > - If I free the allocated memory, will that cause any problems in the
> > cases where no timeout is passed in via the args and the static const
> > struct xs_tcp_default_timeout is assigned to xprt->timeout?
> > - If freeing the static const struct default will cause a problem,
> > what should I do instead? Allocate and memcpy even when assigning the
> > default? And would that mean doing the same thing for all the other
> > transports that are setting timeouts (local, udp, tcp, and bc_tcp)?
>
> [snipping more]

Here's the patch in what I hope is its final form. I'm planning to test it on a couple of hundred nodes over the next month or two.

Since I'm completely new to this, what would be the chances of actually getting this patch in the kernel?

Thanks.

Andrew

From caa3308a3bcf39eb95d9b59e63bd96361e98305e Mon Sep 17 00:00:00 2001
From: Andrew Klaassen <[email protected]>
Date: Fri, 10 Feb 2023 10:37:57 -0500
Subject: [PATCH] Sun RPC: Use passed-in timeouts if available instead of
always using defaults.

---
include/linux/sunrpc/xprt.h | 3 +++
net/sunrpc/clnt.c | 1 +
net/sunrpc/xprt.c | 21 +++++++++++++++++++++
net/sunrpc/xprtsock.c | 22 +++++++++++++++++++---
4 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index b9f59aabee53..ca7be090cf83 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -333,6 +333,7 @@ struct xprt_create {
struct svc_xprt *bc_xprt; /* NFSv4.1 backchannel */
struct rpc_xprt_switch *bc_xps;
unsigned int flags;
+ const struct rpc_timeout *timeout; /* timeout parms */
};

struct xprt_class {
@@ -373,6 +374,8 @@ void xprt_release_xprt_cong(struct rpc_xprt *xprt, struct rpc_task *task);
void xprt_release(struct rpc_task *task);
struct rpc_xprt * xprt_get(struct rpc_xprt *xprt);
void xprt_put(struct rpc_xprt *xprt);
+struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout *timeo,
+ const struct rpc_timeout *default_timeo);
struct rpc_xprt * xprt_alloc(struct net *net, size_t size,
unsigned int num_prealloc,
unsigned int max_req);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0b0b9f1eed46..1350c1f489f7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
.addrlen = args->addrsize,
.servername = args->servername,
.bc_xprt = args->bc_xprt,
+ .timeout = args->timeout,
};
char servername[48];
struct rpc_clnt *clnt;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index ab453ede54f0..0bb800c90976 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1801,6 +1801,26 @@ static void xprt_free_id(struct rpc_xprt *xprt)
ida_free(&rpc_xprt_ids, xprt->id);
}

+struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout *timeo,
+ const struct rpc_timeout *default_timeo)
+{
+ struct rpc_timeout *timeout;
+
+ timeout = kzalloc(sizeof(*timeout), GFP_KERNEL);
+ if (!timeout)
+ return ERR_PTR(-ENOMEM);
+ if (timeo)
+ memcpy(timeout, timeo, sizeof(struct rpc_timeout));
+ else
+ memcpy(timeout, default_timeo, sizeof(struct rpc_timeout));
+ return timeout;
+}
+
+static void xprt_free_timeout(struct rpc_xprt *xprt)
+{
+ kfree(xprt->timeout);
+}
+
struct rpc_xprt *xprt_alloc(struct net *net, size_t size,
unsigned int num_prealloc,
unsigned int max_alloc)
@@ -1837,6 +1857,7 @@ EXPORT_SYMBOL_GPL(xprt_alloc);

void xprt_free(struct rpc_xprt *xprt)
{
+ xprt_free_timeout(xprt);
put_net_track(xprt->xprt_net, &xprt->ns_tracker);
xprt_free_all_slots(xprt);
xprt_free_id(xprt);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aaa5b2741b79..13703f8e0ef1 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2924,7 +2924,12 @@ static struct rpc_xprt *xs_setup_udp(struct xprt_create *args)

xprt->ops = &xs_udp_ops;

- xprt->timeout = &xs_udp_default_timeout;
+ xprt->timeout = xprt_alloc_timeout(args->timeout, &xs_udp_default_timeout);
+ if (IS_ERR(xprt->timeout))
+ {
+ ret = ERR_CAST(xprt->timeout);
+ goto out_err;
+ }

INIT_WORK(&transport->recv_worker, xs_udp_data_receive_workfn);
INIT_WORK(&transport->error_worker, xs_error_handle);
@@ -3003,7 +3008,13 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
xprt->idle_timeout = XS_IDLE_DISC_TO;

xprt->ops = &xs_tcp_ops;
- xprt->timeout = &xs_tcp_default_timeout;
+
+ xprt->timeout = xprt_alloc_timeout(args->timeout, &xs_tcp_default_timeout);
+ if (IS_ERR(xprt->timeout))
+ {
+ ret = ERR_CAST(xprt->timeout);
+ goto out_err;
+ }

xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
xprt->connect_timeout = xprt->timeout->to_initval *
@@ -3071,7 +3082,12 @@ static struct rpc_xprt *xs_setup_bc_tcp(struct xprt_create *args)
xprt->prot = IPPROTO_TCP;
xprt->xprt_class = &xs_bc_tcp_transport;
xprt->max_payload = RPC_MAX_FRAGMENT_SIZE;
- xprt->timeout = &xs_tcp_default_timeout;
+ xprt->timeout = xprt_alloc_timeout(args->timeout, &xs_tcp_default_timeout);
+ if (IS_ERR(xprt->timeout))
+ {
+ ret = ERR_CAST(xprt->timeout);
+ goto out_err;
+ }

/* backchannel */
xprt_set_bound(xprt);
--
2.39.1

2023-02-28 13:23:56

by Jeffrey Layton

[permalink] [raw]
Subject: Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

On Mon, 2023-02-27 at 14:48 +0000, Andrew Klaassen wrote:
> > From: Andrew Klaassen <[email protected]>
> > Sent: Monday, February 6, 2023 12:19 PM
> >
> > > From: Andrew Klaassen <[email protected]>
> > > Sent: Monday, February 6, 2023 10:28 AM
> > >
> >
> > > [snipping for readability; hope that's okay]
> > >
> > > ?- I'm allocating memory. I assume that means I should free it
> > > somewhere.
> > > But where? In xprt_destroy(), which appears to do cleanup? Or in
> > > xprt_destroy_cb(), which is called from xprt_destroy() and which
> > > frees
> > > xprt-
> > > > servername? Or somewhere else completely?
> > > ?- If I free the allocated memory, will that cause any problems in
> > > the
> > > cases where no timeout is passed in via the args and the static
> > > const
> > > struct xs_tcp_default_timeout is assigned to xprt->timeout?
> > > ?- If freeing the static const struct default will cause a
> > > problem,
> > > what should I do instead? Allocate and memcpy even when assigning
> > > the
> > > default? And would that mean doing the same thing for all the
> > > other
> > > transports that are setting timeouts (local, udp, tcp, and
> > > bc_tcp)?
> >
> > [snipping more]
>
> Here's the patch in what I hope is its final form. I'm planning to
> test it on a couple of hundred nodes over the next month or two.
>
> Since I'm completely new to this, what would be the chances of
> actually getting this patch in the kernel?
>
> Thanks.
>
> Andrew
>

Excellent work! I'll be interested to hear how the testing goes!


This patch still needs a bit of work. I'd consider this a proof-of-
concept. You are at least demonstrating the problem with this patch (and
a possible solution).

Conceptually, it's not 100% clear to me that we want the exact same
timeout on the RPC call and the xprt. We might, but working with
interlocking timeouts can bring in emergent behavioral changes and I
haven't thought through these.


> From caa3308a3bcf39eb95d9b59e63bd96361e98305e Mon Sep 17 00:00:00 2001
> From: Andrew Klaassen <[email protected]>
> Date: Fri, 10 Feb 2023 10:37:57 -0500
> Subject: [PATCH] Sun RPC: Use passed-in timeouts if available instead
> of
> ?always using defaults.
>

This needs a real patch description. Describe the problem you were
having, and how this patch changes things to address it. Make sure you
add a Signed-off-by line too.

When you resend, send it to the the nfs client maintainers (Trond and
Anna) using git-format-patch and git-send-email, and cc linux-nfs list.
I think your MUA might have mangled the patch a bit. Please look over
Documentation/process/submitting-patches.rst in the kernel source tree
too.


> ---
> ?include/linux/sunrpc/xprt.h | 3 +++
> ?net/sunrpc/clnt.c | 1 +
> ?net/sunrpc/xprt.c | 21 +++++++++++++++++++++
> ?net/sunrpc/xprtsock.c | 22 +++++++++++++++++++---
> ?4 files changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index b9f59aabee53..ca7be090cf83 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -333,6 +333,7 @@ struct xprt_create {
> ? struct svc_xprt *bc_xprt; /* NFSv4.1
> backchannel */
> ? struct rpc_xprt_switch *bc_xps;
> ? unsigned int flags;
> + const struct rpc_timeout *timeout; /* timeout parms */
> ?};
> ?
> ?struct xprt_class {
> @@ -373,6 +374,8 @@
> void xprt_release_xprt_cong(struct rpc_xprt *xprt, struct rpc_task *task);
> ?void xprt_release(struct rpc_task *task);
> ?struct rpc_xprt * xprt_get(struct rpc_xprt *xprt);
> ?void xprt_put(struct rpc_xprt *xprt);
> +struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout
> *timeo,
> + const struct rpc_timeout
> *default_timeo);
> ?struct rpc_xprt * xprt_alloc(struct net *net, size_t size,
> ? unsigned int num_prealloc,
> ? unsigned int max_req);
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 0b0b9f1eed46..1350c1f489f7 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args
> *args)
> ? .addrlen = args->addrsize,
> ? .servername = args->servername,
> ? .bc_xprt = args->bc_xprt,
> + .timeout = args->timeout,
> ? };
> ? char servername[48];
> ? struct rpc_clnt *clnt;
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index ab453ede54f0..0bb800c90976 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -1801,6 +1801,26 @@ static void xprt_free_id(struct rpc_xprt *xprt)
> ? ida_free(&rpc_xprt_ids, xprt->id);
> ?}
> ?
> +struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout
> *timeo,
> + const struct rpc_timeout
> *default_timeo)
> +{
> + struct rpc_timeout *timeout;
> +
> + timeout = kzalloc(sizeof(*timeout), GFP_KERNEL);
> + if (!timeout)
> + return ERR_PTR(-ENOMEM);
> + if (timeo)
> + memcpy(timeout, timeo, sizeof(struct rpc_timeout));
> + else
> + memcpy(timeout, default_timeo, sizeof(struct
> rpc_timeout));

I don't think you need an allocation here. struct rpc_timeout is quite
small and it only contains a bunch of integers. I think it'd be better
to just embed this in struct rpc_xprt instead.

> + return timeout;
> +}
> +
> +static void xprt_free_timeout(struct rpc_xprt *xprt)
> +{
> + kfree(xprt->timeout);
> +}
> +
> ?struct rpc_xprt *xprt_alloc(struct net *net, size_t size,
> ? unsigned int num_prealloc,
> ? unsigned int max_alloc)
> @@ -1837,6 +1857,7 @@ EXPORT_SYMBOL_GPL(xprt_alloc);
> ?
> ?void xprt_free(struct rpc_xprt *xprt)
> ?{
> + xprt_free_timeout(xprt);
> ? put_net_track(xprt->xprt_net, &xprt->ns_tracker);
> ? xprt_free_all_slots(xprt);
> ? xprt_free_id(xprt);
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index aaa5b2741b79..13703f8e0ef1 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -2924,7 +2924,12 @@ static struct rpc_xprt *xs_setup_udp(struct
> xprt_create *args)
> ?
> ? xprt->ops = &xs_udp_ops;
> ?
> - xprt->timeout = &xs_udp_default_timeout;
> + xprt->timeout = xprt_alloc_timeout(args->timeout,
> &xs_udp_default_timeout);
> + if (IS_ERR(xprt->timeout))
> + {

Kernel coding style has the brackets on the same line as the "if"
statement. You should run your next iteration through checkpatch.pl.


> + ret = ERR_CAST(xprt->timeout);
> + goto out_err;
> + }
> ?
> ? INIT_WORK(&transport->recv_worker,
> xs_udp_data_receive_workfn);
> ? INIT_WORK(&transport->error_worker, xs_error_handle);
> @@ -3003,7 +3008,13 @@ static struct rpc_xprt *xs_setup_tcp(struct
> xprt_create *args)
> ? xprt->idle_timeout = XS_IDLE_DISC_TO;
> ?
> ? xprt->ops = &xs_tcp_ops;
> - xprt->timeout = &xs_tcp_default_timeout;
> +
> + xprt->timeout = xprt_alloc_timeout(args->timeout,
> &xs_tcp_default_timeout);
> + if (IS_ERR(xprt->timeout))
> + {
> + ret = ERR_CAST(xprt->timeout);
> + goto out_err;
> + }
> ?
> ? xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
> ? xprt->connect_timeout = xprt->timeout->to_initval *
> @@ -3071,7 +3082,12 @@ static struct rpc_xprt *xs_setup_bc_tcp(struct
> xprt_create *args)
> ? xprt->prot = IPPROTO_TCP;
> ? xprt->xprt_class = &xs_bc_tcp_transport;
> ? xprt->max_payload = RPC_MAX_FRAGMENT_SIZE;
> - xprt->timeout = &xs_tcp_default_timeout;
> + xprt->timeout = xprt_alloc_timeout(args->timeout,
> &xs_tcp_default_timeout);
> + if (IS_ERR(xprt->timeout))
> + {
> + ret = ERR_CAST(xprt->timeout);
> + goto out_err;
> + }
> ?
> ? /* backchannel */
> ? xprt_set_bound(xprt);
> --
> 2.39.1
>

--
Jeff Layton <[email protected]>

2023-03-02 15:25:53

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Jeff Layton <[email protected]>
> Sent: Tuesday, February 28, 2023 8:24 AM
>
> On Mon, 2023-02-27 at 14:48 +0000, Andrew Klaassen wrote:
> > > From: Andrew Klaassen <[email protected]>
> > > Sent: Monday, February 6, 2023 12:19 PM
> > >
> > > > From: Andrew Klaassen <[email protected]>
> > > > Sent: Monday, February 6, 2023 10:28 AM
> > > >
> > >
> > > > [snipping for readability; hope that's okay]
> > > >
> > > > - I'm allocating memory. I assume that means I should free it
> > > > somewhere.
> > > > But where? In xprt_destroy(), which appears to do cleanup? Or in
> > > > xprt_destroy_cb(), which is called from xprt_destroy() and which
> > > > frees
> > > > xprt-
> > > > > servername? Or somewhere else completely?
> > > > - If I free the allocated memory, will that cause any problems in
> > > > the cases where no timeout is passed in via the args and the
> > > > static const struct xs_tcp_default_timeout is assigned to
> > > > xprt->timeout?
> > > > - If freeing the static const struct default will cause a
> > > > problem, what should I do instead? Allocate and memcpy even when
> > > > assigning the default? And would that mean doing the same thing
> > > > for all the other transports that are setting timeouts (local,
> > > > udp, tcp, and bc_tcp)?
> > >
> > > [snipping more]
> >
> > Here's the patch in what I hope is its final form. I'm planning to
> > test it on a couple of hundred nodes over the next month or two.
> >
> > Since I'm completely new to this, what would be the chances of
> > actually getting this patch in the kernel?
> >
> > Thanks.
> >
> > Andrew
> >
>
> Excellent work! I'll be interested to hear how the testing goes!
>
>
> This patch still needs a bit of work. I'd consider this a proof-of- concept. You
> are at least demonstrating the problem with this patch (and a possible
> solution).
>
> Conceptually, it's not 100% clear to me that we want the exact same timeout
> on the RPC call and the xprt. We might, but working with interlocking
> timeouts can bring in emergent behavioral changes and I haven't thought
> through these.

At this point I'll admit that I don't fully understand the difference between those two, so I expect that your thoughts on it will be more relevant than mine. :-) Happy to get more of your feedback on it. (I do notice that the patch causes timeouts during an initial mount that are twice as long as expected, so I assume that this is related to the two separate things you're talking about.)

Not knowing much, my initial guess is that the solution might be from options like:

- Create a system-wide tuneable for xs_[local|udp|tcp]_default_timeout. In our case that's less-than-ideal, since we want to change the total timeout for an NFS mount on a per-server or per-mount basis rather than a system-wide basis.

- Add a second set of timeout options to NFS so that RPC call and xprt timeouts can be specified separately. I'm guessing no-one is enthusiastic about option bloat, even if this would be the theoretically cleanest option.

- Use timeo and retrans for the RPC call timeout, and retry for the xprt timeout. Or do the opposite. The NFS manpage describes the current behaviour incorrectly, so this at least wouldn't make the documentation any worse.


> > From caa3308a3bcf39eb95d9b59e63bd96361e98305e Mon Sep 17 00:00:00
> 2001
> > From: Andrew Klaassen <[email protected]>
> > Date: Fri, 10 Feb 2023 10:37:57 -0500
> > Subject: [PATCH] Sun RPC: Use passed-in timeouts if available instead
> > of always using defaults.
> >
>
> This needs a real patch description. Describe the problem you were
> having, and how this patch changes things to address it. Make sure you
> add a Signed-off-by line too.
>
> When you resend, send it to the the nfs client maintainers (Trond and
> Anna) using git-format-patch and git-send-email, and cc linux-nfs list.
> I think your MUA might have mangled the patch a bit. Please look over
> Documentation/process/submitting-patches.rst in the kernel source tree
> too.

Thanks for the tips. I did read submitting-patches.rst, but obviously not carefully enough. :-) Would it be appropriate to submit the patch as-is (with check-patch.pl fixes), or should any potential interlocking-timeout issues be addressed first?


> > ---
> > include/linux/sunrpc/xprt.h | 3 +++
> > net/sunrpc/clnt.c | 1 +
> > net/sunrpc/xprt.c | 21 +++++++++++++++++++++
> > net/sunrpc/xprtsock.c | 22 +++++++++++++++++++---
> > 4 files changed, 44 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> > index b9f59aabee53..ca7be090cf83 100644
> > --- a/include/linux/sunrpc/xprt.h
> > +++ b/include/linux/sunrpc/xprt.h
> > @@ -333,6 +333,7 @@ struct xprt_create {
> > struct svc_xprt *bc_xprt; /* NFSv4.1
> > backchannel */
> > struct rpc_xprt_switch *bc_xps;
> > unsigned int flags;
> > + const struct rpc_timeout *timeout; /* timeout parms */
> > };
> >
> > struct xprt_class {
> > @@ -373,6 +374,8 @@
> > void xprt_release_xprt_cong(struct rpc_xprt *xprt, struct rpc_task
> *task);
> > void xprt_release(struct rpc_task *task);
> > struct rpc_xprt * xprt_get(struct rpc_xprt *xprt);
> > void xprt_put(struct rpc_xprt *xprt);
> > +struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout
> > *timeo,
> > + const struct rpc_timeout
> > *default_timeo);
> > struct rpc_xprt * xprt_alloc(struct net *net, size_t size,
> > unsigned int num_prealloc,
> > unsigned int max_req);
> > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> > index 0b0b9f1eed46..1350c1f489f7 100644
> > --- a/net/sunrpc/clnt.c
> > +++ b/net/sunrpc/clnt.c
> > @@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args
> > *args)
> > .addrlen = args->addrsize,
> > .servername = args->servername,
> > .bc_xprt = args->bc_xprt,
> > + .timeout = args->timeout,
> > };
> > char servername[48];
> > struct rpc_clnt *clnt;
> > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > index ab453ede54f0..0bb800c90976 100644
> > --- a/net/sunrpc/xprt.c
> > +++ b/net/sunrpc/xprt.c
> > @@ -1801,6 +1801,26 @@ static void xprt_free_id(struct rpc_xprt *xprt)
> > ida_free(&rpc_xprt_ids, xprt->id);
> > }
> >
> > +struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout
> > *timeo,
> > + const struct rpc_timeout
> > *default_timeo)
> > +{
> > + struct rpc_timeout *timeout;
> > +
> > + timeout = kzalloc(sizeof(*timeout), GFP_KERNEL);
> > + if (!timeout)
> > + return ERR_PTR(-ENOMEM);
> > + if (timeo)
> > + memcpy(timeout, timeo, sizeof(struct rpc_timeout));
> > + else
> > + memcpy(timeout, default_timeo, sizeof(struct
> > rpc_timeout));
>
> I don't think you need an allocation here. struct rpc_timeout is quite
> small and it only contains a bunch of integers. I think it'd be better
> to just embed this in struct rpc_xprt instead.
>
> > + return timeout;
> > +}
> > +
> > +static void xprt_free_timeout(struct rpc_xprt *xprt)
> > +{
> > + kfree(xprt->timeout);
> > +}
> > +
> > struct rpc_xprt *xprt_alloc(struct net *net, size_t size,
> > unsigned int num_prealloc,
> > unsigned int max_alloc)
> > @@ -1837,6 +1857,7 @@ EXPORT_SYMBOL_GPL(xprt_alloc);
> >
> > void xprt_free(struct rpc_xprt *xprt)
> > {
> > + xprt_free_timeout(xprt);
> > put_net_track(xprt->xprt_net, &xprt->ns_tracker);
> > xprt_free_all_slots(xprt);
> > xprt_free_id(xprt);
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > index aaa5b2741b79..13703f8e0ef1 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -2924,7 +2924,12 @@ static struct rpc_xprt *xs_setup_udp(struct
> > xprt_create *args)
> >
> > xprt->ops = &xs_udp_ops;
> >
> > - xprt->timeout = &xs_udp_default_timeout;
> > + xprt->timeout = xprt_alloc_timeout(args->timeout,
> > &xs_udp_default_timeout);
> > + if (IS_ERR(xprt->timeout))
> > + {
>
> Kernel coding style has the brackets on the same line as the "if"
> statement. You should run your next iteration through checkpatch.pl.

Thanks. I did that once, but forgot to do it again before sending this version.


> > + ret = ERR_CAST(xprt->timeout);
> > + goto out_err;
> > + }
> >
> > INIT_WORK(&transport->recv_worker,
> > xs_udp_data_receive_workfn);
> > INIT_WORK(&transport->error_worker, xs_error_handle);
> > @@ -3003,7 +3008,13 @@ static struct rpc_xprt *xs_setup_tcp(struct
> > xprt_create *args)
> > xprt->idle_timeout = XS_IDLE_DISC_TO;
> >
> > xprt->ops = &xs_tcp_ops;
> > - xprt->timeout = &xs_tcp_default_timeout;
> > +
> > + xprt->timeout = xprt_alloc_timeout(args->timeout,
> > &xs_tcp_default_timeout);
> > + if (IS_ERR(xprt->timeout))
> > + {
> > + ret = ERR_CAST(xprt->timeout);
> > + goto out_err;
> > + }
> >
> > xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
> > xprt->connect_timeout = xprt->timeout->to_initval *
> > @@ -3071,7 +3082,12 @@ static struct rpc_xprt *xs_setup_bc_tcp(struct
> > xprt_create *args)
> > xprt->prot = IPPROTO_TCP;
> > xprt->xprt_class = &xs_bc_tcp_transport;
> > xprt->max_payload = RPC_MAX_FRAGMENT_SIZE;
> > - xprt->timeout = &xs_tcp_default_timeout;
> > + xprt->timeout = xprt_alloc_timeout(args->timeout,
> > &xs_tcp_default_timeout);
> > + if (IS_ERR(xprt->timeout))
> > + {
> > + ret = ERR_CAST(xprt->timeout);
> > + goto out_err;
> > + }
> >
> > /* backchannel */
> > xprt_set_bound(xprt);
> > --
> > 2.39.1
> >
>
> --
> Jeff Layton <[email protected]>

Andrew Klaassen



2023-03-02 18:47:27

by Andrew Klaassen

[permalink] [raw]
Subject: RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

> From: Jeff Layton <[email protected]>
> Sent: Tuesday, February 28, 2023 8:24 AM
>
> On Mon, 2023-02-27 at 14:48 +0000, Andrew Klaassen wrote:
> > +struct rpc_timeout *xprt_alloc_timeout(const struct rpc_timeout
> > *timeo,
> > + const struct rpc_timeout
> > *default_timeo)
> > +{
> > + struct rpc_timeout *timeout;
> > +
> > + timeout = kzalloc(sizeof(*timeout), GFP_KERNEL);
> > + if (!timeout)
> > + return ERR_PTR(-ENOMEM);
> > + if (timeo)
> > + memcpy(timeout, timeo, sizeof(struct rpc_timeout));
> > + else
> > + memcpy(timeout, default_timeo, sizeof(struct
> > rpc_timeout));
>
> I don't think you need an allocation here. struct rpc_timeout is quite
> small and it only contains a bunch of integers. I think it'd be better
> to just embed this in struct rpc_xprt instead.

I missed this in my initial reply; apologies. What do you mean by "embed" in this case?

FWIW, every time I tried assigning xprt->timeout without an allocation the timeout values would be correct just after the assignment in xs_setup_tcp, but by the time the code got to xs_tcp_set_socket_timeouts the timeout would be filled with random values. I'm sure this reflects my limitations as not-a-C-programmer, but no matter which way I tried it I couldn't stop that from happening until I allocated memory.

Thanks.

Andrew Klaassen