LinuxLists.cc - NFS4 in combination with root over NFS3, hangs and dedlocks

2010-03-29 15:14:43

by Anton Starikov

[permalink] [raw]

Subject: NFS4 in combination with root over NFS3, hangs and dedlocks

Attachments:

log1.txt.gz (44.40 kB)
log1.txt.gz

2010-03-31 00:35:29

by Anton Starikov

[permalink] [raw]

Subject: Re: NFS4 in combination with root over NFS3, hangs and deadlocks

Ok, I can be wrong in my guess,

But I found another report earlier in mailing list archive.

Subject was "NFS regression? Odd delays and lockups accessing an NFS export." with last message from 2008-09-27 10:16:26

There were a lot of traffic with attempts to investigate problem. But I didn't find information was it resolved or not.

On Mar 31, 2010, at 2:09 AM, Anton Starikov wrote:

> I'm not an expert in kernel debugging, but I think hang happens in rpcauth_lookup_credcache
>
>
> On Mar 30, 2010, at 10:59 PM, Anton Starikov wrote:
>
>> Then it isn't normal.
>> Diskless setup is limited by old NFS3 for non-root partition, which isn't nice.
>> no proper ACL, no delegations.
>>
>>
>> On Mar 30, 2010, at 9:27 PM, Chuck Lever wrote:
>>
>>> On 03/30/2010 03:11 PM, Anton Starikov wrote:
>>>> On Mar 30, 2010, at 9:00 PM, Chuck Lever wrote:
>>>>
>>>>> On 03/30/2010 02:30 PM, Anton Starikov wrote:
>>>>>> If it is already resolved problem, can someone point me into direction of particular patch?
>>>>>
>>>>> As far as I know NFSv4 is known not to work with an NFSv3 root, in any kernel.
>>>>
>>>>
>>>> But NFS4-root (does it work finally?) isn't always desirable solution. Especially if different OSes used for client/server.
>>>>
>>>> And it seems that generally it works, just some deadlock occurs, probably related to caching of some credentials.
>>>
>>> No, NFSv4 root is known to have problems, and is unsupported, as far as I know.
>>>
>>>> Anton,
>>>>
>>>>>> Anton.
>>>>>>
>>>>>>
>>>>>> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>>>>>>
>>>>>>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>>>>>>
>>>>>>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>>>>>>
>>>>>>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>>>>>>
>>>>>>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anton.
>>>>>>>
>>>>>>> # cat /proc/mounts | grep nfs
>>>>>>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>>>>>>> none /var/lib/nfs tmpfs rw,relatime 0 0
>>>>>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>>>>>>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>>>>>>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>>>
>>>>>>> <log1.txt.gz>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>> --
>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>
>>>
>>>
>>> --
>>> chuck[dot]lever[at]oracle[dot]com
>>
>

2010-03-30 19:11:24

by Anton Starikov

[permalink] [raw]

Subject: Re: NFS4 in combination with root over NFS3, hangs and deadlocks

On Mar 30, 2010, at 9:00 PM, Chuck Lever wrote:

> On 03/30/2010 02:30 PM, Anton Starikov wrote:
>> If it is already resolved problem, can someone point me into direction of particular patch?
>
> As far as I know NFSv4 is known not to work with an NFSv3 root, in any kernel.

But NFS4-root (does it work finally?) isn't always desirable solution. Especially if different OSes used for client/server.

And it seems that generally it works, just some deadlock occurs, probably related to caching of some credentials.

Anton,

>> Anton.
>>
>>
>> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>>
>>> Hi,
>>>
>>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>>
>>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>>
>>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>>
>>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>>
>>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>>
>>> Thanks,
>>> Anton.
>>>
>>> # cat /proc/mounts | grep nfs
>>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>>> none /var/lib/nfs tmpfs rw,relatime 0 0
>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>
>>> <log1.txt.gz>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> chuck[dot]lever[at]oracle[dot]com

2010-03-30 19:01:39

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS4 in combination with root over NFS3, hangs and dedlocks

On 03/30/2010 02:30 PM, Anton Starikov wrote:
> If it is already resolved problem, can someone point me into direction of particular patch?

As far as I know NFSv4 is known not to work with an NFSv3 root, in any
kernel.

> Anton.
>
>
> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>
>> Hi,
>>
>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>
>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>
>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>
>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>
>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>
>> Thanks,
>> Anton.
>>
>> # cat /proc/mounts | grep nfs
>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>> none /var/lib/nfs tmpfs rw,relatime 0 0
>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>
>> <log1.txt.gz>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
chuck[dot]lever[at]oracle[dot]com

2010-03-31 00:09:23

by Anton Starikov

[permalink] [raw]

Subject: Re: NFS4 in combination with root over NFS3, hangs and deadlocks

I'm not an expert in kernel debugging, but I think hang happens in rpcauth_lookup_credcache

On Mar 30, 2010, at 10:59 PM, Anton Starikov wrote:

> Then it isn't normal.
> Diskless setup is limited by old NFS3 for non-root partition, which isn't nice.
> no proper ACL, no delegations.
>
>
> On Mar 30, 2010, at 9:27 PM, Chuck Lever wrote:
>
>> On 03/30/2010 03:11 PM, Anton Starikov wrote:
>>> On Mar 30, 2010, at 9:00 PM, Chuck Lever wrote:
>>>
>>>> On 03/30/2010 02:30 PM, Anton Starikov wrote:
>>>>> If it is already resolved problem, can someone point me into direction of particular patch?
>>>>
>>>> As far as I know NFSv4 is known not to work with an NFSv3 root, in any kernel.
>>>
>>>
>>> But NFS4-root (does it work finally?) isn't always desirable solution. Especially if different OSes used for client/server.
>>>
>>> And it seems that generally it works, just some deadlock occurs, probably related to caching of some credentials.
>>
>> No, NFSv4 root is known to have problems, and is unsupported, as far as I know.
>>
>>> Anton,
>>>
>>>>> Anton.
>>>>>
>>>>>
>>>>> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>>>>>
>>>>>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>>>>>
>>>>>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>>>>>
>>>>>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>>>>>
>>>>>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>>>>>
>>>>>> Thanks,
>>>>>> Anton.
>>>>>>
>>>>>> # cat /proc/mounts | grep nfs
>>>>>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>>>>>> none /var/lib/nfs tmpfs rw,relatime 0 0
>>>>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>>>>>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>>>>>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>>
>>>>>> <log1.txt.gz>
>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>> --
>>>> chuck[dot]lever[at]oracle[dot]com
>>>
>>
>>
>> --
>> chuck[dot]lever[at]oracle[dot]com
>

2010-03-30 19:29:09

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS4 in combination with root over NFS3, hangs and deadlocks

On 03/30/2010 03:11 PM, Anton Starikov wrote:
> On Mar 30, 2010, at 9:00 PM, Chuck Lever wrote:
>
>> On 03/30/2010 02:30 PM, Anton Starikov wrote:
>>> If it is already resolved problem, can someone point me into direction of particular patch?
>>
>> As far as I know NFSv4 is known not to work with an NFSv3 root, in any kernel.
>
>
> But NFS4-root (does it work finally?) isn't always desirable solution. Especially if different OSes used for client/server.
>
> And it seems that generally it works, just some deadlock occurs, probably related to caching of some credentials.

No, NFSv4 root is known to have problems, and is unsupported, as far as
I know.

> Anton,
>
>>> Anton.
>>>
>>>
>>> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>>>
>>>> Hi,
>>>>
>>>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>>>
>>>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>>>
>>>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>>>
>>>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>>>
>>>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>>>
>>>> Thanks,
>>>> Anton.
>>>>
>>>> # cat /proc/mounts | grep nfs
>>>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>>>> none /var/lib/nfs tmpfs rw,relatime 0 0
>>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>>>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>>>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>
>>>> <log1.txt.gz>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> chuck[dot]lever[at]oracle[dot]com
>

--
chuck[dot]lever[at]oracle[dot]com

2010-03-30 20:59:55

by Anton Starikov

[permalink] [raw]

Subject: Re: NFS4 in combination with root over NFS3, hangs and deadlocks

Then it isn't normal.
Diskless setup is limited by old NFS3 for non-root partition, which isn't nice.
no proper ACL, no delegations.

On Mar 30, 2010, at 9:27 PM, Chuck Lever wrote:

> On 03/30/2010 03:11 PM, Anton Starikov wrote:
>> On Mar 30, 2010, at 9:00 PM, Chuck Lever wrote:
>>
>>> On 03/30/2010 02:30 PM, Anton Starikov wrote:
>>>> If it is already resolved problem, can someone point me into direction of particular patch?
>>>
>>> As far as I know NFSv4 is known not to work with an NFSv3 root, in any kernel.
>>
>>
>> But NFS4-root (does it work finally?) isn't always desirable solution. Especially if different OSes used for client/server.
>>
>> And it seems that generally it works, just some deadlock occurs, probably related to caching of some credentials.
>
> No, NFSv4 root is known to have problems, and is unsupported, as far as I know.
>
>> Anton,
>>
>>>> Anton.
>>>>
>>>>
>>>> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>>>>
>>>>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>>>>
>>>>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>>>>
>>>>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>>>>
>>>>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>>>>
>>>>> Thanks,
>>>>> Anton.
>>>>>
>>>>> # cat /proc/mounts | grep nfs
>>>>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>>>>> none /var/lib/nfs tmpfs rw,relatime 0 0
>>>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>>>>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>>>>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>
>>>>> <log1.txt.gz>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> chuck[dot]lever[at]oracle[dot]com
>>
>
>
> --
> chuck[dot]lever[at]oracle[dot]com

2010-03-30 18:30:51

by Anton Starikov

[permalink] [raw]

Subject: Re: NFS4 in combination with root over NFS3, hangs and dedlocks

If it is already resolved problem, can someone point me into direction of particular patch?

Anton.

On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:

> Hi,
>
> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>
> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>
> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>
> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>
> As starting point I will attach output of echo "t" > sysrq-trigge, list of NFS mounts.
>
> Thanks,
> Anton.
>
> # cat /proc/mounts | grep nfs
> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
> none /var/lib/nfs tmpfs rw,relatime 0 0
> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>
> <log1.txt.gz>
>

2010-04-02 17:23:51

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS4 random hangs

On 04/02/2010 08:48 AM, Anton Starikov wrote:
> Any chance that someone will look at this case and try to help with pinpointing problem?
>
> It's not that I expect something, but in case someone might be interesting and want to step in, I'm keeping mounting over NFS4 and experiencing hangs. But if now it is not the best time, then I will switch to NFS3.

I suspect no-one will get to this soon.

> Anton.
>
> On Mar 31, 2010, at 2:35 AM, Anton Starikov wrote:
>
>> Ok, I can be wrong in my guess,
>>
>> But I found another report earlier in mailing list archive.
>>
>> Subject was "NFS regression? Odd delays and lockups accessing an NFS export." with last message from 2008-09-27 10:16:26
>>
>> There were a lot of traffic with attempts to investigate problem. But I didn't find information was it resolved or not.
>>
>>
>>
>>
>>
>> On Mar 31, 2010, at 2:09 AM, Anton Starikov wrote:
>>
>>> I'm not an expert in kernel debugging, but I think hang happens in rpcauth_lookup_credcache
>>>
>>>
>>> On Mar 30, 2010, at 10:59 PM, Anton Starikov wrote:
>>>
>>>> Then it isn't normal.
>>>> Diskless setup is limited by old NFS3 for non-root partition, which isn't nice.
>>>> no proper ACL, no delegations.
>>>>
>>>>
>>>> On Mar 30, 2010, at 9:27 PM, Chuck Lever wrote:
>>>>
>>>>> On 03/30/2010 03:11 PM, Anton Starikov wrote:
>>>>>> On Mar 30, 2010, at 9:00 PM, Chuck Lever wrote:
>>>>>>
>>>>>>> On 03/30/2010 02:30 PM, Anton Starikov wrote:
>>>>>>>> If it is already resolved problem, can someone point me into direction of particular patch?
>>>>>>>
>>>>>>> As far as I know NFSv4 is known not to work with an NFSv3 root, in any kernel.
>>>>>>
>>>>>>
>>>>>> But NFS4-root (does it work finally?) isn't always desirable solution. Especially if different OSes used for client/server.
>>>>>>
>>>>>> And it seems that generally it works, just some deadlock occurs, probably related to caching of some credentials.
>>>>>
>>>>> No, NFSv4 root is known to have problems, and is unsupported, as far as I know.
>>>>>
>>>>>> Anton,
>>>>>>
>>>>>>>> Anton.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>>>>>>>>
>>>>>>>>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>>>>>>>>
>>>>>>>>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>>>>>>>>
>>>>>>>>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>>>>>>>>
>>>>>>>>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anton.
>>>>>>>>>
>>>>>>>>> # cat /proc/mounts | grep nfs
>>>>>>>>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>>>>>>>>> none /var/lib/nfs tmpfs rw,relatime 0 0
>>>>>>>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>>>>>>>>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>>>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>>>>>>>>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>>>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>>>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>>>>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>>>>>
>>>>>>>>> <log1.txt.gz>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>> the body of a message to [email protected]
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>
>>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
chuck[dot]lever[at]oracle[dot]com

2010-04-02 12:54:30

by Anton Starikov

[permalink] [raw]

Subject: Re: NFS4 random hangs

Any chance that someone will look at this case and try to help with pinpointing problem?

It's not that I expect something, but in case someone might be interesting and want to step in, I'm keeping mounting over NFS4 and experiencing hangs. But if now it is not the best time, then I will switch to NFS3.

Anton.

On Mar 31, 2010, at 2:35 AM, Anton Starikov wrote:

> Ok, I can be wrong in my guess,
>
> But I found another report earlier in mailing list archive.
>
> Subject was "NFS regression? Odd delays and lockups accessing an NFS export." with last message from 2008-09-27 10:16:26
>
> There were a lot of traffic with attempts to investigate problem. But I didn't find information was it resolved or not.
>
>
>
>
>
> On Mar 31, 2010, at 2:09 AM, Anton Starikov wrote:
>
>> I'm not an expert in kernel debugging, but I think hang happens in rpcauth_lookup_credcache
>>
>>
>> On Mar 30, 2010, at 10:59 PM, Anton Starikov wrote:
>>
>>> Then it isn't normal.
>>> Diskless setup is limited by old NFS3 for non-root partition, which isn't nice.
>>> no proper ACL, no delegations.
>>>
>>>
>>> On Mar 30, 2010, at 9:27 PM, Chuck Lever wrote:
>>>
>>>> On 03/30/2010 03:11 PM, Anton Starikov wrote:
>>>>> On Mar 30, 2010, at 9:00 PM, Chuck Lever wrote:
>>>>>
>>>>>> On 03/30/2010 02:30 PM, Anton Starikov wrote:
>>>>>>> If it is already resolved problem, can someone point me into direction of particular patch?
>>>>>>
>>>>>> As far as I know NFSv4 is known not to work with an NFSv3 root, in any kernel.
>>>>>
>>>>>
>>>>> But NFS4-root (does it work finally?) isn't always desirable solution. Especially if different OSes used for client/server.
>>>>>
>>>>> And it seems that generally it works, just some deadlock occurs, probably related to caching of some credentials.
>>>>
>>>> No, NFSv4 root is known to have problems, and is unsupported, as far as I know.
>>>>
>>>>> Anton,
>>>>>
>>>>>>> Anton.
>>>>>>>
>>>>>>>
>>>>>>> On Mar 29, 2010, at 5:14 PM, Anton Starikov wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Early (year ago and recently) I reported about my faults in getting working NFS4 mounts (primary automounting /home) with system booted with NFSv3-root. It always used to silently hang nodes with zero output in the logs. It was definitely client issue (I tried it with different versions of linux and solaris servers)
>>>>>>>>
>>>>>>>> Although I can't get simple and reproducible test-case, because hangs appears randomly, it can happen in 1hour, it can happen in 5 days, but it always will happen after some time. But this time I got some some improvement.
>>>>>>>>
>>>>>>>> With 2.6.32.9-70.fc12.x86_64 kernel and fresh nfs-utils from Fedora-12, after NFS4 mounts hangs, NFS3 mounts and node itself still continue to work, which gives chance to investigate problem.
>>>>>>>>
>>>>>>>> Can you give me instruction how to collect all necessary information to figure out where the bug is?
>>>>>>>>
>>>>>>>> As starting point I will attach output of echo "t"> sysrq-trigge, list of NFS mounts.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anton.
>>>>>>>>
>>>>>>>> # cat /proc/mounts | grep nfs
>>>>>>>> 172.19.8.1:/export/share/cluster/fedora-root / nfs ro,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nolock,proto=udp,port=65535,timeo=7,retrans=3,sec=sys,mountport=65535,addr=172.19.8.1 0 0
>>>>>>>> none /var/lib/nfs tmpfs rw,relatime 0 0
>>>>>>>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
>>>>>>>> 172.19.8.1:/export/share/cluster/admin /root nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>>> 172.19.8.1:/export/share/cluster/checkpoint /mnt/checkpoint nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=52574,mountproto=udp,addr=172.19.8.1 0 0
>>>>>>>> 172.19.8.1:/export/share/software /software nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>>> 172.19.8.1:/export/share/cluster/torque /var/torque nfs rw,noatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.19.8.1,mountvers=3,mountport=44114,mountproto=tcp,addr=172.19.8.1 0 0
>>>>>>>> 172.19.8.1:/export/share/common/ /common nfs4 rw,noatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>>>> 172.19.8.1:/export/home/alfons/ /home/alfons nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.19.8.133,addr=172.19.8.1 0 0
>>>>>>>>
>>>>>>>> <log1.txt.gz>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>> the body of a message to [email protected]
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>> --
>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>
>>>>
>>>>
>>>> --
>>>> chuck[dot]lever[at]oracle[dot]com
>>>
>>
>