2020-04-20 15:13:28

by sea you

[permalink] [raw]
Subject: TestStateID woes with recent clients

Dear all,

Time-to-time we're plagued with a lot of "TestStateID" RPC calls on a
4.15.0-88 (Ubuntu Bionic) kernel, where clients (~310 VMS) are using
either 4.19.106 or 4.19.107 (Flatcar Linux). What we see during these
"storms", is that _some_ clients are testing the same id for callback
like

[Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
[Thu Apr 9 15:18:57 2020] NFS call test_stateid 00000000ec5d02eb
[Thu Apr 9 15:18:57 2020] --> nfs41_call_sync_prepare
data->seq_server 000000006dfc86c9
[Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000
highest_used=4294967295 max_slots=31
[Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001
highest_used=0 slotid=0
[Thu Apr 9 15:18:57 2020] encode_sequence:
sessionid=1585584999:2538115180:5741:0 seqid=13899229 slotid=0
max_slotid=0 cache_this=0
[Thu Apr 9 15:18:57 2020] nfs41_handle_sequence_flag_errors:
"10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040
[Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001
highest_used=0 max_slots=31
[Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003
highest_used=1 slotid=1
[Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0
[Thu Apr 9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot
[Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 0
highest_used_slotid 4294967295
[Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
[Thu Apr 9 15:18:57 2020] NFS call test_stateid 00000000ec5d02eb
[Thu Apr 9 15:18:57 2020] --> nfs41_call_sync_prepare
data->seq_server 000000006dfc86c9
[Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000
highest_used=4294967295 max_slots=31
[Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001
highest_used=0 slotid=0
[Thu Apr 9 15:18:57 2020] encode_sequence:
sessionid=1585584999:2538115180:5741:0 seqid=13899230 slotid=0
max_slotid=0 cache_this=0
[Thu Apr 9 15:18:57 2020] nfs41_handle_sequence_flag_errors:
"10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040
[Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001
highest_used=0 max_slots=31
[Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003
highest_used=1 slotid=1
[Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0
[Thu Apr 9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot
[Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 0
highest_used_slotid 4294967295
[Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0

Due to this, some processes on some clients are stuck and these nodes
need to be rebooted. Initially, we thought we're facing the issue that
was fixed in 44f411c353bf, but as I see we're already using a kernel
where it was backported to via 90d73c1cadb8.

Clients are mounting as
"rw,nosuid,nodev,noexec,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=600,acregmax=600,acdirmin=600,acdirmax=600,hard,proto=tcp,timeo=600,retrans=2,sec=sys"

Export options are the following
"<world>(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,fsid=762,sec=sys,rw,secure,no_root_squash,no_all_squash)"
(fsid varies from export to export of course)

Our workload is super metadata heavy (PHP) and data being served
changes a lot as clients are uploading files etc.

We have a similar setup where clients are 4.19.(6|7)8 (CoreOS) and the
server is 4.15.0-76, where we rarely see these TestID RPC calls.

It's worth to mention that between the two setups that is okay and the
one that is not, the main difference is using different block size
(the one with 512byte is okay, the other one with 4k isn't) in the
backing filesystem (ZFS), although I'm unsure how would that affect
NFS at all.

The issue manifests at least once every day.

Can you please point me in a direction that I should check further?

Doma


2020-04-24 14:56:00

by sea you

[permalink] [raw]
Subject: Re: TestStateID woes with recent clients

Hey all,

I would really appreciate some help here, I took a capture on the
client when this was happening and as you can see from it, it churns
the same dozens of times over and over again.

https://filebin.net/zs9hfipxbz2mn7i8

Thanks in advance,
Doma

On Fri, Apr 24, 2020 at 3:45 PM sea you <[email protected]> wrote:
>
> Hey all,
>
> I would really appreciate some help here, I took a capture on the client when this was happening and as you can see from it, it churns the same dozens of times over and over again.
>
> https://filebin.net/zs9hfipxbz2mn7i8
>
> Thanks in advance.
> Doma
>
> On Mon, Apr 20, 2020 at 4:32 PM sea you <[email protected]> wrote:
>>
>> Dear all,
>>
>> Time-to-time we're plagued with a lot of "TestStateID" RPC calls on a
>> 4.15.0-88 (Ubuntu Bionic) kernel, where clients (~310 VMS) are using
>> either 4.19.106 or 4.19.107 (Flatcar Linux). What we see during these
>> "storms", is that _some_ clients are testing the same id for callback
>> like
>>
>> [Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
>> [Thu Apr 9 15:18:57 2020] NFS call test_stateid 00000000ec5d02eb
>> [Thu Apr 9 15:18:57 2020] --> nfs41_call_sync_prepare
>> data->seq_server 000000006dfc86c9
>> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000
>> highest_used=4294967295 max_slots=31
>> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001
>> highest_used=0 slotid=0
>> [Thu Apr 9 15:18:57 2020] encode_sequence:
>> sessionid=1585584999:2538115180:5741:0 seqid=13899229 slotid=0
>> max_slotid=0 cache_this=0
>> [Thu Apr 9 15:18:57 2020] nfs41_handle_sequence_flag_errors:
>> "10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040
>> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001
>> highest_used=0 max_slots=31
>> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003
>> highest_used=1 slotid=1
>> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0
>> [Thu Apr 9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot
>> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 0
>> highest_used_slotid 4294967295
>> [Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
>> [Thu Apr 9 15:18:57 2020] NFS call test_stateid 00000000ec5d02eb
>> [Thu Apr 9 15:18:57 2020] --> nfs41_call_sync_prepare
>> data->seq_server 000000006dfc86c9
>> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000
>> highest_used=4294967295 max_slots=31
>> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001
>> highest_used=0 slotid=0
>> [Thu Apr 9 15:18:57 2020] encode_sequence:
>> sessionid=1585584999:2538115180:5741:0 seqid=13899230 slotid=0
>> max_slotid=0 cache_this=0
>> [Thu Apr 9 15:18:57 2020] nfs41_handle_sequence_flag_errors:
>> "10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040
>> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001
>> highest_used=0 max_slots=31
>> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003
>> highest_used=1 slotid=1
>> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0
>> [Thu Apr 9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot
>> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 0
>> highest_used_slotid 4294967295
>> [Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
>>
>> Due to this, some processes on some clients are stuck and these nodes
>> need to be rebooted. Initially, we thought we're facing the issue that
>> was fixed in 44f411c353bf, but as I see we're already using a kernel
>> where it was backported to via 90d73c1cadb8.
>>
>> Clients are mounting as
>> "rw,nosuid,nodev,noexec,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=600,acregmax=600,acdirmin=600,acdirmax=600,hard,proto=tcp,timeo=600,retrans=2,sec=sys"
>>
>> Export options are the following
>> "<world>(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,fsid=762,sec=sys,rw,secure,no_root_squash,no_all_squash)"
>> (fsid varies from export to export of course)
>>
>> Our workload is super metadata heavy (PHP) and data being served
>> changes a lot as clients are uploading files etc.
>>
>> We have a similar setup where clients are 4.19.(6|7)8 (CoreOS) and the
>> server is 4.15.0-76, where we rarely see these TestID RPC calls.
>>
>> It's worth to mention that between the two setups that is okay and the
>> one that is not, the main difference is using different block size
>> (the one with 512byte is okay, the other one with 4k isn't) in the
>> backing filesystem (ZFS), although I'm unsure how would that affect
>> NFS at all.
>>
>> The issue manifests at least once every day.
>>
>> Can you please point me in a direction that I should check further?
>>
>> Doma

2020-04-27 22:20:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: TestStateID woes with recent clients

On Mon, Apr 20, 2020 at 04:32:27PM +0200, sea you wrote:
> Time-to-time we're plagued with a lot of "TestStateID" RPC calls on a
> 4.15.0-88 (Ubuntu Bionic) kernel, where clients (~310 VMS) are using
> either 4.19.106 or 4.19.107 (Flatcar Linux). What we see during these
> "storms", is that _some_ clients are testing the same id for callback
> like
...
> Due to this, some processes on some clients are stuck and these nodes
> need to be rebooted. Initially, we thought we're facing the issue that
> was fixed in 44f411c353bf, but as I see we're already using a kernel
> where it was backported to via 90d73c1cadb8.
>
> Clients are mounting as
> "rw,nosuid,nodev,noexec,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=600,acregmax=600,acdirmin=600,acdirmax=600,hard,proto=tcp,timeo=600,retrans=2,sec=sys"
>
> Export options are the following
> "<world>(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,fsid=762,sec=sys,rw,secure,no_root_squash,no_all_squash)"

Sorry for the derail, but the "async" export option is almost a never
good idea.

> Can you please point me in a direction that I should check further?

Off the top of my head, my only suggestion is to retest if possible on
the latest upstream kernel.

--b.