LinuxLists.cc - NFS regression between 5.17 and 5.18

2022-04-28 15:17:20

Subject: NFS regression between 5.17 and 5.18

Hi NFS folks,

I've noticed a pretty nasty regression in our NFS capability between 5.17 and
5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing
is it takes 3 minutes to copy a file from NFS to the local disk. When it should
take less than half a second, which it did up through 5.17.

It doesn't seem to be network related, but can't rule that out completely.

I tried to bisect but the problem can be intermittent. Some runs I'll see a
problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it
100 out of 100.

The latest attempt I have is at 5.18-rc4 and here's what I see when I 1) create
file with dd, 2) copy local disk to NFS mount, 3) copy NFS to local disk.

Test 2
Creating /dev/shm/run_nfs_test.junk...
262144+0 records in
262144+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.29037 s, 208 MB/s
Done
copy /dev/shm/run_nfs_test.junk to /mnt/nfs_test/run_nfs_test.junk...

real 0m0.502s
user 0m0.001s
sys 0m0.250s
Done
copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...

real 4m11.835s
user 0m0.001s
sys 0m0.277s
Done
Comparing files...
Done
Remove /dev/shm/run_nfs_test.tmp...

real 0m0.031s
user 0m0.000s
sys 0m0.031s
Done
Remove /dev/shm/run_nfs_test.junk...

real 0m0.031s
user 0m0.001s
sys 0m0.030s
Done
Remove /mnt/nfs_test/run_nfs_test.junk...

real 0m0.024s
user 0m0.001s
sys 0m0.022s
Done

Create the server with a 4G RAM disk:
mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test

exportfs -o fsid=0,rw,async,insecure,no_root_squash
192.168.254.0/255.255.255.0:/mnt/nfs_test

Mount on the client side with:
mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test

Any advice on tracking this down further would be greatly appreciated.

Thanks

-Denny

2022-04-28 16:43:26

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro <[email protected]> wrote:
>
> Hi NFS folks,
>
> I've noticed a pretty nasty regression in our NFS capability between 5.17 and
> 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing
> is it takes 3 minutes to copy a file from NFS to the local disk. When it should
> take less than half a second, which it did up through 5.17.
>
> It doesn't seem to be network related, but can't rule that out completely.
>
> I tried to bisect but the problem can be intermittent. Some runs I'll see a
> problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it
> 100 out of 100.

It's not clear from your problem report whether the problem appears
when it's the server running v5.18-rc or the client.

It looks vaguely like a recent client issue where it encounters a
memory shortage while preparing an RPC and a delay is needed.

> The latest attempt I have is at 5.18-rc4 and here's what I see when I 1) create
> file with dd, 2) copy local disk to NFS mount, 3) copy NFS to local disk.
>
> Test 2
> Creating /dev/shm/run_nfs_test.junk...
> 262144+0 records in
> 262144+0 records out
> 268435456 bytes (268 MB, 256 MiB) copied, 1.29037 s, 208 MB/s
> Done
> copy /dev/shm/run_nfs_test.junk to /mnt/nfs_test/run_nfs_test.junk...
>
> real 0m0.502s
> user 0m0.001s
> sys 0m0.250s
> Done
> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...
>
> real 4m11.835s
> user 0m0.001s
> sys 0m0.277s
> Done
> Comparing files...
> Done
> Remove /dev/shm/run_nfs_test.tmp...
>
> real 0m0.031s
> user 0m0.000s
> sys 0m0.031s
> Done
> Remove /dev/shm/run_nfs_test.junk...
>
> real 0m0.031s
> user 0m0.001s
> sys 0m0.030s
> Done
> Remove /mnt/nfs_test/run_nfs_test.junk...
>
> real 0m0.024s
> user 0m0.001s
> sys 0m0.022s
> Done
>
> Create the server with a 4G RAM disk:
> mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test
>
> exportfs -o fsid=0,rw,async,insecure,no_root_squash
> 192.168.254.0/255.255.255.0:/mnt/nfs_test
>
> Mount on the client side with:
> mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test
>
> Any advice on tracking this down further would be greatly appreciated.
>
> Thanks
>
> -Denny
>

--
Chuck Lever

2022-04-29 09:33:48

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
> > On 4/28/22 10:57 AM, Chuck Lever III wrote:
> > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
> > > > <[email protected]> wrote:
> > > >
> > > > Hi NFS folks,
> > > >
> > > > I've noticed a pretty nasty regression in our NFS capability
> > > > between 5.17 and
> > > > 5.18-rc1. I've tried to bisect but not having any luck. The
> > > > problem I'm seeing
> > > > is it takes 3 minutes to copy a file from NFS to the local
> > > > disk. When it should
> > > > take less than half a second, which it did up through 5.17.
> > > >
> > > > It doesn't seem to be network related, but can't rule that out
> > > > completely.
> > > >
> > > > I tried to bisect but the problem can be intermittent. Some
> > > > runs I'll see a
> > > > problem in 3 out of 100 cycles, sometimes 0 out of 100.
> > > > Sometimes I'll see it
> > > > 100 out of 100.
> > >
> > > It's not clear from your problem report whether the problem
> > > appears
> > > when it's the server running v5.18-rc or the client.
> >
> > That's because I don't know which it is. I'll do a quick test and
> > find out. I
> > was testing the same kernel across both nodes.
>
> Looks like it is the client.
>
> server client result
> ------ ------ ------
> 5.17    5.17    Pass
> 5.17    5.18    Fail
> 5.18    5.18    Fail
> 5.18    5.17    Pass
>
> Is there a patch for the client issue you mentioned that I could try?
>
> -Denny

Try this one

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

Attachments:

0001-SUNRPC-Fix-an-RPC-RDMA-performance-regression.patch (2.39 kB)
0001-SUNRPC-Fix-an-RPC-RDMA-performance-regression.patch

2022-04-29 09:47:01

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro <[email protected]> wrote:
>>>
>>> Hi NFS folks,
>>>
>>> I've noticed a pretty nasty regression in our NFS capability between 5.17 and
>>> 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing
>>> is it takes 3 minutes to copy a file from NFS to the local disk. When it should
>>> take less than half a second, which it did up through 5.17.
>>>
>>> It doesn't seem to be network related, but can't rule that out completely.
>>>
>>> I tried to bisect but the problem can be intermittent. Some runs I'll see a
>>> problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it
>>> 100 out of 100.
>>
>> It's not clear from your problem report whether the problem appears
>> when it's the server running v5.18-rc or the client.
>
> That's because I don't know which it is. I'll do a quick test and find out. I
> was testing the same kernel across both nodes.

Looks like it is the client.

server client result
------ ------ ------
5.17 5.17 Pass
5.17 5.18 Fail
5.18 5.18 Fail
5.18 5.17 Pass

Is there a patch for the client issue you mentioned that I could try?

-Denny

2022-04-29 10:56:23

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 4/28/22 10:57 AM, Chuck Lever III wrote:
>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro <[email protected]> wrote:
>>
>> Hi NFS folks,
>>
>> I've noticed a pretty nasty regression in our NFS capability between 5.17 and
>> 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing
>> is it takes 3 minutes to copy a file from NFS to the local disk. When it should
>> take less than half a second, which it did up through 5.17.
>>
>> It doesn't seem to be network related, but can't rule that out completely.
>>
>> I tried to bisect but the problem can be intermittent. Some runs I'll see a
>> problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it
>> 100 out of 100.
>
> It's not clear from your problem report whether the problem appears
> when it's the server running v5.18-rc or the client.

That's because I don't know which it is. I'll do a quick test and find out. I
was testing the same kernel across both nodes.

-Denny

2022-04-29 13:55:34

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 4/28/22 3:56 PM, Trond Myklebust wrote:
> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi NFS folks,
>>>>>
>>>>> I've noticed a pretty nasty regression in our NFS capability
>>>>> between 5.17 and
>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The
>>>>> problem I'm seeing
>>>>> is it takes 3 minutes to copy a file from NFS to the local
>>>>> disk. When it should
>>>>> take less than half a second, which it did up through 5.17.
>>>>>
>>>>> It doesn't seem to be network related, but can't rule that out
>>>>> completely.
>>>>>
>>>>> I tried to bisect but the problem can be intermittent. Some
>>>>> runs I'll see a
>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100.
>>>>> Sometimes I'll see it
>>>>> 100 out of 100.
>>>>
>>>> It's not clear from your problem report whether the problem
>>>> appears
>>>> when it's the server running v5.18-rc or the client.
>>>
>>> That's because I don't know which it is. I'll do a quick test and
>>> find out. I
>>> was testing the same kernel across both nodes.
>>
>> Looks like it is the client.
>>
>> server client result
>> ------ ------ ------
>> 5.17    5.17    Pass
>> 5.17    5.18    Fail
>> 5.18    5.18    Fail
>> 5.18    5.17    Pass
>>
>> Is there a patch for the client issue you mentioned that I could try?
>>
>> -Denny
>
> Try this one

Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still
see intermittent hangs. I applied it on top of -rc4:

copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...

real 2m6.072s
user 0m0.002s
sys 0m0.263s
Done

While it was hung I checked the mem usage on the machine:

# free -h
total used free shared buff/cache available
Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi
Swap: 4.0Gi 0B 4.0Gi

Doesn't appear to be under memory pressure.

-Denny

2022-04-29 18:13:58

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <[email protected]> wrote:
>
> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi NFS folks,
>>>>>>
>>>>>> I've noticed a pretty nasty regression in our NFS capability
>>>>>> between 5.17 and
>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The
>>>>>> problem I'm seeing
>>>>>> is it takes 3 minutes to copy a file from NFS to the local
>>>>>> disk. When it should
>>>>>> take less than half a second, which it did up through 5.17.
>>>>>>
>>>>>> It doesn't seem to be network related, but can't rule that out
>>>>>> completely.
>>>>>>
>>>>>> I tried to bisect but the problem can be intermittent. Some
>>>>>> runs I'll see a
>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100.
>>>>>> Sometimes I'll see it
>>>>>> 100 out of 100.
>>>>>
>>>>> It's not clear from your problem report whether the problem
>>>>> appears
>>>>> when it's the server running v5.18-rc or the client.
>>>>
>>>> That's because I don't know which it is. I'll do a quick test and
>>>> find out. I
>>>> was testing the same kernel across both nodes.
>>>
>>> Looks like it is the client.
>>>
>>> server client result
>>> ------ ------ ------
>>> 5.17 5.17 Pass
>>> 5.17 5.18 Fail
>>> 5.18 5.18 Fail
>>> 5.18 5.17 Pass
>>>
>>> Is there a patch for the client issue you mentioned that I could try?
>>>
>>> -Denny
>>
>> Try this one
>
> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still
> see intermittent hangs. I applied it on top of -rc4:
>
> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...
>
> real 2m6.072s
> user 0m0.002s
> sys 0m0.263s
> Done
>
> While it was hung I checked the mem usage on the machine:
>
> # free -h
> total used free shared buff/cache available
> Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi
> Swap: 4.0Gi 0B 4.0Gi
>
> Doesn't appear to be under memory pressure.

Hi, since you know now that it is the client, perhaps a bisect
would be more successful?

--
Chuck Lever

2022-05-04 17:09:03

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced v5.17..v5.18
#regzbot title nfs: copy from NFS to the local disk suddently takes 3
min instead of half a second
#regzbot ignore-activity

If it turns out this isn't a regression, free free to remove it from the
tracking by sending a reply to this thread containing a paragraph like
"#regzbot invalid: reason why this is invalid" (without the quotes).

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

On 28.04.22 15:05, Dennis Dalessandro wrote:
> Hi NFS folks,
>
> I've noticed a pretty nasty regression in our NFS capability between 5.17 and
> 5.18-rc1. I've tried to bisect but not having any luck. The problem I'm seeing
> is it takes 3 minutes to copy a file from NFS to the local disk. When it should
> take less than half a second, which it did up through 5.17.
>
> It doesn't seem to be network related, but can't rule that out completely.
>
> I tried to bisect but the problem can be intermittent. Some runs I'll see a
> problem in 3 out of 100 cycles, sometimes 0 out of 100. Sometimes I'll see it
> 100 out of 100.
>
> The latest attempt I have is at 5.18-rc4 and here's what I see when I 1) create
> file with dd, 2) copy local disk to NFS mount, 3) copy NFS to local disk.
>
> Test 2
> Creating /dev/shm/run_nfs_test.junk...
> 262144+0 records in
> 262144+0 records out
> 268435456 bytes (268 MB, 256 MiB) copied, 1.29037 s, 208 MB/s
> Done
> copy /dev/shm/run_nfs_test.junk to /mnt/nfs_test/run_nfs_test.junk...
>
> real 0m0.502s
> user 0m0.001s
> sys 0m0.250s
> Done
> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...
>
> real 4m11.835s
> user 0m0.001s
> sys 0m0.277s
> Done
> Comparing files...
> Done
> Remove /dev/shm/run_nfs_test.tmp...
>
> real 0m0.031s
> user 0m0.000s
> sys 0m0.031s
> Done
> Remove /dev/shm/run_nfs_test.junk...
>
> real 0m0.031s
> user 0m0.001s
> sys 0m0.030s
> Done
> Remove /mnt/nfs_test/run_nfs_test.junk...
>
> real 0m0.024s
> user 0m0.001s
> sys 0m0.022s
> Done
>
> Create the server with a 4G RAM disk:
> mount -t tmpfs -o size=4096M tmpfs /mnt/nfs_test
>
> exportfs -o fsid=0,rw,async,insecure,no_root_squash
> 192.168.254.0/255.255.255.0:/mnt/nfs_test
>
> Mount on the client side with:
> mount -o rdma,port=20049 192.168.254.1:/mnt/nfs_test /mnt/nfs_test
>
> Any advice on tracking this down further would be greatly appreciated.
>
> Thanks
>
> -Denny
>

--
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.

2022-05-06 15:49:25

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 4/29/22 9:37 AM, Chuck Lever III wrote:
>
>
>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <[email protected]> wrote:
>>
>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi NFS folks,
>>>>>>>
>>>>>>> I've noticed a pretty nasty regression in our NFS capability
>>>>>>> between 5.17 and
>>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The
>>>>>>> problem I'm seeing
>>>>>>> is it takes 3 minutes to copy a file from NFS to the local
>>>>>>> disk. When it should
>>>>>>> take less than half a second, which it did up through 5.17.
>>>>>>>
>>>>>>> It doesn't seem to be network related, but can't rule that out
>>>>>>> completely.
>>>>>>>
>>>>>>> I tried to bisect but the problem can be intermittent. Some
>>>>>>> runs I'll see a
>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100.
>>>>>>> Sometimes I'll see it
>>>>>>> 100 out of 100.
>>>>>>
>>>>>> It's not clear from your problem report whether the problem
>>>>>> appears
>>>>>> when it's the server running v5.18-rc or the client.
>>>>>
>>>>> That's because I don't know which it is. I'll do a quick test and
>>>>> find out. I
>>>>> was testing the same kernel across both nodes.
>>>>
>>>> Looks like it is the client.
>>>>
>>>> server client result
>>>> ------ ------ ------
>>>> 5.17 5.17 Pass
>>>> 5.17 5.18 Fail
>>>> 5.18 5.18 Fail
>>>> 5.18 5.17 Pass
>>>>
>>>> Is there a patch for the client issue you mentioned that I could try?
>>>>
>>>> -Denny
>>>
>>> Try this one
>>
>> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still
>> see intermittent hangs. I applied it on top of -rc4:
>>
>> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...
>>
>> real 2m6.072s
>> user 0m0.002s
>> sys 0m0.263s
>> Done
>>
>> While it was hung I checked the mem usage on the machine:
>>
>> # free -h
>> total used free shared buff/cache available
>> Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi
>> Swap: 4.0Gi 0B 4.0Gi
>>
>> Doesn't appear to be under memory pressure.
>
> Hi, since you know now that it is the client, perhaps a bisect
> would be more successful?

I've been testing all week. I pulled the nfs-rdma tree that was sent to Linus
for 5.18 and tested. I see the problem on pretty much all the patches. However
it's the frequency that it hits which changes.

I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
"NFS: Convert readdir page cache to use a cookie based index"

After this I start seeing it around 10 times in 500 and by the last patch 10
times in less than 100.

Is there any kind of tracing/debugging I could turn on to get more insight on
what is taking so long when it does go bad?

-Denny

2022-05-13 14:27:55

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
> On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
> > On 4/29/22 9:37 AM, Chuck Lever III wrote:
> > >
> > >
> > > > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
> > > > <[email protected]> wrote:
> > > >
> > > > On 4/28/22 3:56 PM, Trond Myklebust wrote:
> > > > > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
> > > > > > On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
> > > > > > > On 4/28/22 10:57 AM, Chuck Lever III wrote:
> > > > > > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Hi NFS folks,
> > > > > > > > >
> > > > > > > > > I've noticed a pretty nasty regression in our NFS
> > > > > > > > > capability
> > > > > > > > > between 5.17 and
> > > > > > > > > 5.18-rc1. I've tried to bisect but not having any
> > > > > > > > > luck. The
> > > > > > > > > problem I'm seeing
> > > > > > > > > is it takes 3 minutes to copy a file from NFS to the
> > > > > > > > > local
> > > > > > > > > disk. When it should
> > > > > > > > > take less than half a second, which it did up through
> > > > > > > > > 5.17.
> > > > > > > > >
> > > > > > > > > It doesn't seem to be network related, but can't rule
> > > > > > > > > that out
> > > > > > > > > completely.
> > > > > > > > >
> > > > > > > > > I tried to bisect but the problem can be
> > > > > > > > > intermittent. Some
> > > > > > > > > runs I'll see a
> > > > > > > > > problem in 3 out of 100 cycles, sometimes 0 out of
> > > > > > > > > 100.
> > > > > > > > > Sometimes I'll see it
> > > > > > > > > 100 out of 100.
> > > > > > > >
> > > > > > > > It's not clear from your problem report whether the
> > > > > > > > problem
> > > > > > > > appears
> > > > > > > > when it's the server running v5.18-rc or the client.
> > > > > > >
> > > > > > > That's because I don't know which it is. I'll do a quick
> > > > > > > test and
> > > > > > > find out. I
> > > > > > > was testing the same kernel across both nodes.
> > > > > >
> > > > > > Looks like it is the client.
> > > > > >
> > > > > > server client result
> > > > > > ------ ------ ------
> > > > > > 5.17    5.17    Pass
> > > > > > 5.17    5.18    Fail
> > > > > > 5.18    5.18    Fail
> > > > > > 5.18    5.17    Pass
> > > > > >
> > > > > > Is there a patch for the client issue you mentioned that I
> > > > > > could try?
> > > > > >
> > > > > > -Denny
> > > > >
> > > > > Try this one
> > > >
> > > > Thanks for the patch. Unfortunately it doesn't seem to solve
> > > > the issue, still
> > > > see intermittent hangs. I applied it on top of -rc4:
> > > >
> > > > copy /mnt/nfs_test/run_nfs_test.junk to
> > > > /dev/shm/run_nfs_test.tmp...
> > > >
> > > > real    2m6.072s
> > > > user    0m0.002s
> > > > sys     0m0.263s
> > > > Done
> > > >
> > > > While it was hung I checked the mem usage on the machine:
> > > >
> > > > # free -h
> > > >              total        used        free      shared
> > > > buff/cache   available
> > > > Mem:           62Gi       871Mi        61Gi       342Mi
> > > > 889Mi        61Gi
> > > > Swap:         4.0Gi          0B       4.0Gi
> > > >
> > > > Doesn't appear to be under memory pressure.
> > >
> > > Hi, since you know now that it is the client, perhaps a bisect
> > > would be more successful?
> >
> > I've been testing all week. I pulled the nfs-rdma tree that was
> > sent to Linus
> > for 5.18 and tested. I see the problem on pretty much all the
> > patches. However
> > it's the frequency that it hits which changes.
> >
> > I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
> > "NFS: Convert readdir page cache to use a cookie based index"
> >
> > After this I start seeing it around 10 times in 500 and by the last
> > patch 10
> > times in less than 100.
> >
> > Is there any kind of tracing/debugging I could turn on to get more
> > insight on
> > what is taking so long when it does go bad?
> >
>
> Ran a test with -rc6 and this time see a hung task trace on the
> console as well
> as an NFS RPC error.
>
> [32719.991175] nfs: RPC call returned error 512
> .
> .
> .
> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
> than 122 seconds.
> [32933.293543]       Tainted: G S                5.18.0-rc6 #1
> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this
> message.
> [32933.308740] task:kworker/u145:23 state:D stack:    0 pid:886141
> ppid:     2
> flags:0x00004000
> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [32933.324524] Call Trace:
> [32933.327347] <TASK>
> [32933.329785] __schedule+0x3dd/0x970
> [32933.333783] schedule+0x41/0xa0
> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
> [32933.343639] ? prepare_to_wait+0xd0/0xd0
> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
> [32933.376564] ? __switch_to_asm+0x42/0x70
> [32933.381046] ? finish_task_switch+0xb2/0x2c0
> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
> [32933.391391] process_one_work+0x1c8/0x390
> [32933.395975] worker_thread+0x30/0x360
> [32933.400162] ? process_one_work+0x390/0x390
> [32933.404931] kthread+0xd9/0x100
> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
> [32933.413984] ret_from_fork+0x22/0x30
> [32933.418074] </TASK>
>
> The call trace shows up again at 245, 368, and 491 seconds. Same
> task, same trace.
>
>
>
>

That's very helpful. The above trace suggests that the RDMA code is
leaking a call to xprt_unpin_rqst().

From what I can see, rpcrdma_reply_handler() will always call
xprt_pin_rqst(), and that call is required to be matched by a call to
xprt_unpin_rqst() before the handler exits. However those calls to
xprt_unpin_rqst() appear to be scattered around the code, and it is not
at all obvious that the requirement is being fulfilled.

Chuck, why does that code need to be so complex?

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2022-05-13 16:00:21

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On Fri, 2022-05-13 at 14:59 +0000, Chuck Lever III wrote:
>
>
> > On May 13, 2022, at 9:30 AM, Trond Myklebust
> > <[email protected]> wrote:
> >
> > On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
> > > On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
> > > > On 4/29/22 9:37 AM, Chuck Lever III wrote:
> > > > >
> > > > >
> > > > > > On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On 4/28/22 3:56 PM, Trond Myklebust wrote:
> > > > > > > On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro
> > > > > > > wrote:
> > > > > > > > On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
> > > > > > > > > On 4/28/22 10:57 AM, Chuck Lever III wrote:
> > > > > > > > > > > On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
> > > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi NFS folks,
> > > > > > > > > > >
> > > > > > > > > > > I've noticed a pretty nasty regression in our NFS
> > > > > > > > > > > capability
> > > > > > > > > > > between 5.17 and
> > > > > > > > > > > 5.18-rc1. I've tried to bisect but not having any
> > > > > > > > > > > luck. The
> > > > > > > > > > > problem I'm seeing
> > > > > > > > > > > is it takes 3 minutes to copy a file from NFS to
> > > > > > > > > > > the
> > > > > > > > > > > local
> > > > > > > > > > > disk. When it should
> > > > > > > > > > > take less than half a second, which it did up
> > > > > > > > > > > through
> > > > > > > > > > > 5.17.
> > > > > > > > > > >
> > > > > > > > > > > It doesn't seem to be network related, but can't
> > > > > > > > > > > rule
> > > > > > > > > > > that out
> > > > > > > > > > > completely.
> > > > > > > > > > >
> > > > > > > > > > > I tried to bisect but the problem can be
> > > > > > > > > > > intermittent. Some
> > > > > > > > > > > runs I'll see a
> > > > > > > > > > > problem in 3 out of 100 cycles, sometimes 0 out
> > > > > > > > > > > of
> > > > > > > > > > > 100.
> > > > > > > > > > > Sometimes I'll see it
> > > > > > > > > > > 100 out of 100.
> > > > > > > > > >
> > > > > > > > > > It's not clear from your problem report whether the
> > > > > > > > > > problem
> > > > > > > > > > appears
> > > > > > > > > > when it's the server running v5.18-rc or the
> > > > > > > > > > client.
> > > > > > > > >
> > > > > > > > > That's because I don't know which it is. I'll do a
> > > > > > > > > quick
> > > > > > > > > test and
> > > > > > > > > find out. I
> > > > > > > > > was testing the same kernel across both nodes.
> > > > > > > >
> > > > > > > > Looks like it is the client.
> > > > > > > >
> > > > > > > > server client result
> > > > > > > > ------ ------ ------
> > > > > > > > 5.17    5.17    Pass
> > > > > > > > 5.17    5.18    Fail
> > > > > > > > 5.18    5.18    Fail
> > > > > > > > 5.18    5.17    Pass
> > > > > > > >
> > > > > > > > Is there a patch for the client issue you mentioned
> > > > > > > > that I
> > > > > > > > could try?
> > > > > > > >
> > > > > > > > -Denny
> > > > > > >
> > > > > > > Try this one
> > > > > >
> > > > > > Thanks for the patch. Unfortunately it doesn't seem to
> > > > > > solve
> > > > > > the issue, still
> > > > > > see intermittent hangs. I applied it on top of -rc4:
> > > > > >
> > > > > > copy /mnt/nfs_test/run_nfs_test.junk to
> > > > > > /dev/shm/run_nfs_test.tmp...
> > > > > >
> > > > > > real    2m6.072s
> > > > > > user    0m0.002s
> > > > > > sys     0m0.263s
> > > > > > Done
> > > > > >
> > > > > > While it was hung I checked the mem usage on the machine:
> > > > > >
> > > > > > # free -h
> > > > > >              total        used        free      shared
> > > > > > buff/cache   available
> > > > > > Mem:           62Gi       871Mi        61Gi
> > > > > > 342Mi
> > > > > > 889Mi        61Gi
> > > > > > Swap:         4.0Gi          0B       4.0Gi
> > > > > >
> > > > > > Doesn't appear to be under memory pressure.
> > > > >
> > > > > Hi, since you know now that it is the client, perhaps a
> > > > > bisect
> > > > > would be more successful?
> > > >
> > > > I've been testing all week. I pulled the nfs-rdma tree that was
> > > > sent to Linus
> > > > for 5.18 and tested. I see the problem on pretty much all the
> > > > patches. However
> > > > it's the frequency that it hits which changes.
> > > >
> > > > I'll see 1-5 cycles out of 2500 where the copy takes minutes up
> > > > to:
> > > > "NFS: Convert readdir page cache to use a cookie based index"
> > > >
> > > > After this I start seeing it around 10 times in 500 and by the
> > > > last
> > > > patch 10
> > > > times in less than 100.
> > > >
> > > > Is there any kind of tracing/debugging I could turn on to get
> > > > more
> > > > insight on
> > > > what is taking so long when it does go bad?
> > > >
> > >
> > > Ran a test with -rc6 and this time see a hung task trace on the
> > > console as well
> > > as an NFS RPC error.
> > >
> > > [32719.991175] nfs: RPC call returned error 512
> > > .
> > > .
> > > .
> > > [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
> > > than 122 seconds.
> > > [32933.293543]       Tainted: G S                5.18.0-rc6 #1
> > > [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this
> > > message.
> > > [32933.308740] task:kworker/u145:23 state:D stack:    0
> > > pid:886141
> > > ppid:     2
> > > flags:0x00004000
> > > [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > [32933.324524] Call Trace:
> > > [32933.327347] <TASK>
> > > [32933.329785] __schedule+0x3dd/0x970
> > > [32933.333783] schedule+0x41/0xa0
> > > [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
> > > [32933.343639] ? prepare_to_wait+0xd0/0xd0
> > > [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> > > [32933.354183] xprt_release+0x26/0x140 [sunrpc]
> > > [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
> > > [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
> > > [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
> > > [32933.376564] ? __switch_to_asm+0x42/0x70
> > > [32933.381046] ? finish_task_switch+0xb2/0x2c0
> > > [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
> > > [32933.391391] process_one_work+0x1c8/0x390
> > > [32933.395975] worker_thread+0x30/0x360
> > > [32933.400162] ? process_one_work+0x390/0x390
> > > [32933.404931] kthread+0xd9/0x100
> > > [32933.408536] ? kthread_complete_and_exit+0x20/0x20
> > > [32933.413984] ret_from_fork+0x22/0x30
> > > [32933.418074] </TASK>
> > >
> > > The call trace shows up again at 245, 368, and 491 seconds. Same
> > > task, same trace.
> > >
> > >
> > >
> > >
> >
> > That's very helpful. The above trace suggests that the RDMA code is
> > leaking a call to xprt_unpin_rqst().
>
> IMHO this is unlikely to be related to the performance
> regression -- none of this code has changed in the past 5
> kernel releases. Could be a different issue, though.
>
> As is often the case in these situations, the INFO trace
> above happens long after the issue that caused the missing
> unpin. So... unless Dennis has a reproducer that can trigger
> the issue frequently, I don't think there's much that can
> be extracted from that.
>
> Also "nfs: RPC call returned error 512" suggests someone
> hit ^C at some point. It's always possible that the
> xprt_rdma_free() path is missing an unpin. But again,
> that's not likely to be related to performance.
>
>
> > From what I can see, rpcrdma_reply_handler() will always call
> > xprt_pin_rqst(), and that call is required to be matched by a call
> > to
> > xprt_unpin_rqst() before the handler exits. However those calls to
> > xprt_unpin_rqst() appear to be scattered around the code, and it is
> > not
> > at all obvious that the requirement is being fulfilled.
>
> I don't think unpin is required before rpcrdma_reply_handler()
> exits. It /is/ required after xprt_complete_rqst(), and
> sometimes that call has to be deferred until the NIC is
> fully done with the RPC's buffers. The NIC is often not
> finished with the buffers in the Receive completion handler.
>
>
> > Chuck, why does that code need to be so complex?
>
> The overall goal is that the upper layer cannot be awoken
> until the NIC has finished with the RPC's header and
> payload buffers. These buffers are treated differently
> depending on how the send path chose to marshal the RPC.
>
> And, all of these are hot paths, so it's important to
> ensure that we are waiting/context-switching as
> infrequently as possible. We wake the upper layer waiter
> during an RDMA completion to reduce context switching.
> The first completion does not always mean there's nothing
> left for the NIC to do.
>
> Ignoring the server backchannel, there is a single unpin
> in the "complete without error" case in
> rpcrdma_complete_rqst().
>
> There are three call sites for rpcrdma_complete_rqst:
>
> 1. rpcrdma_reply_done - called when no explicit invalidation
> is needed and the Receive completion fired after the Send
> completion.
>
> 2. rpcrdma_sendctx_done - called when no explicit invalidation
> is needed and the Send completion fired after the Receive
> completion.
>
> 3. frwr_wc_localinv_done - called when explicit invalidation
> was needed and that invalidation has completed. SQ ordering
> guarantees that the Send has also completed at this point,
> so a separate wait for Send completion is not needed.
>
> These are the points when the NIC has finished with all of
> the memory resources associated with an RPC, so that's
> when it is safe to wake the waiter and unpin the RPC.
>

From what I can see, what's important in the case of RDMA is to ensure
that we call exactly 1 of either frwr_unmap_async(), frwr_unmap_sync()
or rpcrdma_complete_rqst(), and that whichever operation we select must
completed before we call xprt->ops->free_slot().

xprt_pin_rqst()/xprt_unpin_rqst() can help to ensure exclusion between
those 3 operations, but it is not required if there are other ways to
do so.
In fact since xprt_wait_on_pinned_rqst() forces a synchronous sleep in
xprt_release(), it would be much preferable to find a better solution
than to pin the request if you concern is performance.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2022-05-14 01:00:20

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
> On 4/29/22 9:37 AM, Chuck Lever III wrote:
>>
>>
>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <[email protected]> wrote:
>>>
>>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi NFS folks,
>>>>>>>>
>>>>>>>> I've noticed a pretty nasty regression in our NFS capability
>>>>>>>> between 5.17 and
>>>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The
>>>>>>>> problem I'm seeing
>>>>>>>> is it takes 3 minutes to copy a file from NFS to the local
>>>>>>>> disk. When it should
>>>>>>>> take less than half a second, which it did up through 5.17.
>>>>>>>>
>>>>>>>> It doesn't seem to be network related, but can't rule that out
>>>>>>>> completely.
>>>>>>>>
>>>>>>>> I tried to bisect but the problem can be intermittent. Some
>>>>>>>> runs I'll see a
>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100.
>>>>>>>> Sometimes I'll see it
>>>>>>>> 100 out of 100.
>>>>>>>
>>>>>>> It's not clear from your problem report whether the problem
>>>>>>> appears
>>>>>>> when it's the server running v5.18-rc or the client.
>>>>>>
>>>>>> That's because I don't know which it is. I'll do a quick test and
>>>>>> find out. I
>>>>>> was testing the same kernel across both nodes.
>>>>>
>>>>> Looks like it is the client.
>>>>>
>>>>> server client result
>>>>> ------ ------ ------
>>>>> 5.17 5.17 Pass
>>>>> 5.17 5.18 Fail
>>>>> 5.18 5.18 Fail
>>>>> 5.18 5.17 Pass
>>>>>
>>>>> Is there a patch for the client issue you mentioned that I could try?
>>>>>
>>>>> -Denny
>>>>
>>>> Try this one
>>>
>>> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still
>>> see intermittent hangs. I applied it on top of -rc4:
>>>
>>> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp...
>>>
>>> real 2m6.072s
>>> user 0m0.002s
>>> sys 0m0.263s
>>> Done
>>>
>>> While it was hung I checked the mem usage on the machine:
>>>
>>> # free -h
>>> total used free shared buff/cache available
>>> Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi
>>> Swap: 4.0Gi 0B 4.0Gi
>>>
>>> Doesn't appear to be under memory pressure.
>>
>> Hi, since you know now that it is the client, perhaps a bisect
>> would be more successful?
>
> I've been testing all week. I pulled the nfs-rdma tree that was sent to Linus
> for 5.18 and tested. I see the problem on pretty much all the patches. However
> it's the frequency that it hits which changes.
>
> I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
> "NFS: Convert readdir page cache to use a cookie based index"
>
> After this I start seeing it around 10 times in 500 and by the last patch 10
> times in less than 100.
>
> Is there any kind of tracing/debugging I could turn on to get more insight on
> what is taking so long when it does go bad?
>

Ran a test with -rc6 and this time see a hung task trace on the console as well
as an NFS RPC error.

[32719.991175] nfs: RPC call returned error 512
.
.
.
[32933.285126] INFO: task kworker/u145:23:886141 blocked for more than 122 seconds.
[32933.293543] Tainted: G S 5.18.0-rc6 #1
[32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141 ppid: 2
flags:0x00004000
[32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
[32933.324524] Call Trace:
[32933.327347] <TASK>
[32933.329785] __schedule+0x3dd/0x970
[32933.333783] schedule+0x41/0xa0
[32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
[32933.343639] ? prepare_to_wait+0xd0/0xd0
[32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
[32933.354183] xprt_release+0x26/0x140 [sunrpc]
[32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
[32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
[32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
[32933.376564] ? __switch_to_asm+0x42/0x70
[32933.381046] ? finish_task_switch+0xb2/0x2c0
[32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
[32933.391391] process_one_work+0x1c8/0x390
[32933.395975] worker_thread+0x30/0x360
[32933.400162] ? process_one_work+0x390/0x390
[32933.404931] kthread+0xd9/0x100
[32933.408536] ? kthread_complete_and_exit+0x20/0x20
[32933.413984] ret_from_fork+0x22/0x30
[32933.418074] </TASK>

The call trace shows up again at 245, 368, and 491 seconds. Same task, same trace.

2022-05-14 02:59:25

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

> On May 13, 2022, at 11:19 AM, Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2022-05-13 at 14:59 +0000, Chuck Lever III wrote:
>>
>>
>>> On May 13, 2022, at 9:30 AM, Trond Myklebust
>>> <[email protected]> wrote:
>>>
>>> On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
>>>> On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
>>>>> On 4/29/22 9:37 AM, Chuck Lever III wrote:
>>>>>>
>>>>>>
>>>>>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>>>>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro
>>>>>>>> wrote:
>>>>>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi NFS folks,
>>>>>>>>>>>>
>>>>>>>>>>>> I've noticed a pretty nasty regression in our NFS
>>>>>>>>>>>> capability
>>>>>>>>>>>> between 5.17 and
>>>>>>>>>>>> 5.18-rc1. I've tried to bisect but not having any
>>>>>>>>>>>> luck. The
>>>>>>>>>>>> problem I'm seeing
>>>>>>>>>>>> is it takes 3 minutes to copy a file from NFS to
>>>>>>>>>>>> the
>>>>>>>>>>>> local
>>>>>>>>>>>> disk. When it should
>>>>>>>>>>>> take less than half a second, which it did up
>>>>>>>>>>>> through
>>>>>>>>>>>> 5.17.
>>>>>>>>>>>>
>>>>>>>>>>>> It doesn't seem to be network related, but can't
>>>>>>>>>>>> rule
>>>>>>>>>>>> that out
>>>>>>>>>>>> completely.
>>>>>>>>>>>>
>>>>>>>>>>>> I tried to bisect but the problem can be
>>>>>>>>>>>> intermittent. Some
>>>>>>>>>>>> runs I'll see a
>>>>>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out
>>>>>>>>>>>> of
>>>>>>>>>>>> 100.
>>>>>>>>>>>> Sometimes I'll see it
>>>>>>>>>>>> 100 out of 100.
>>>>>>>>>>>
>>>>>>>>>>> It's not clear from your problem report whether the
>>>>>>>>>>> problem
>>>>>>>>>>> appears
>>>>>>>>>>> when it's the server running v5.18-rc or the
>>>>>>>>>>> client.
>>>>>>>>>>
>>>>>>>>>> That's because I don't know which it is. I'll do a
>>>>>>>>>> quick
>>>>>>>>>> test and
>>>>>>>>>> find out. I
>>>>>>>>>> was testing the same kernel across both nodes.
>>>>>>>>>
>>>>>>>>> Looks like it is the client.
>>>>>>>>>
>>>>>>>>> server client result
>>>>>>>>> ------ ------ ------
>>>>>>>>> 5.17 5.17 Pass
>>>>>>>>> 5.17 5.18 Fail
>>>>>>>>> 5.18 5.18 Fail
>>>>>>>>> 5.18 5.17 Pass
>>>>>>>>>
>>>>>>>>> Is there a patch for the client issue you mentioned
>>>>>>>>> that I
>>>>>>>>> could try?
>>>>>>>>>
>>>>>>>>> -Denny
>>>>>>>>
>>>>>>>> Try this one
>>>>>>>
>>>>>>> Thanks for the patch. Unfortunately it doesn't seem to
>>>>>>> solve
>>>>>>> the issue, still
>>>>>>> see intermittent hangs. I applied it on top of -rc4:
>>>>>>>
>>>>>>> copy /mnt/nfs_test/run_nfs_test.junk to
>>>>>>> /dev/shm/run_nfs_test.tmp...
>>>>>>>
>>>>>>> real 2m6.072s
>>>>>>> user 0m0.002s
>>>>>>> sys 0m0.263s
>>>>>>> Done
>>>>>>>
>>>>>>> While it was hung I checked the mem usage on the machine:
>>>>>>>
>>>>>>> # free -h
>>>>>>> total used free shared
>>>>>>> buff/cache available
>>>>>>> Mem: 62Gi 871Mi 61Gi
>>>>>>> 342Mi
>>>>>>> 889Mi 61Gi
>>>>>>> Swap: 4.0Gi 0B 4.0Gi
>>>>>>>
>>>>>>> Doesn't appear to be under memory pressure.
>>>>>>
>>>>>> Hi, since you know now that it is the client, perhaps a
>>>>>> bisect
>>>>>> would be more successful?
>>>>>
>>>>> I've been testing all week. I pulled the nfs-rdma tree that was
>>>>> sent to Linus
>>>>> for 5.18 and tested. I see the problem on pretty much all the
>>>>> patches. However
>>>>> it's the frequency that it hits which changes.
>>>>>
>>>>> I'll see 1-5 cycles out of 2500 where the copy takes minutes up
>>>>> to:
>>>>> "NFS: Convert readdir page cache to use a cookie based index"
>>>>>
>>>>> After this I start seeing it around 10 times in 500 and by the
>>>>> last
>>>>> patch 10
>>>>> times in less than 100.
>>>>>
>>>>> Is there any kind of tracing/debugging I could turn on to get
>>>>> more
>>>>> insight on
>>>>> what is taking so long when it does go bad?
>>>>>
>>>>
>>>> Ran a test with -rc6 and this time see a hung task trace on the
>>>> console as well
>>>> as an NFS RPC error.
>>>>
>>>> [32719.991175] nfs: RPC call returned error 512
>>>> .
>>>> .
>>>> .
>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>>>> than 122 seconds.
>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1
>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this
>>>> message.
>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0
>>>> pid:886141
>>>> ppid: 2
>>>> flags:0x00004000
>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>> [32933.324524] Call Trace:
>>>> [32933.327347] <TASK>
>>>> [32933.329785] __schedule+0x3dd/0x970
>>>> [32933.333783] schedule+0x41/0xa0
>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0
>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
>>>> [32933.376564] ? __switch_to_asm+0x42/0x70
>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0
>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
>>>> [32933.391391] process_one_work+0x1c8/0x390
>>>> [32933.395975] worker_thread+0x30/0x360
>>>> [32933.400162] ? process_one_work+0x390/0x390
>>>> [32933.404931] kthread+0xd9/0x100
>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
>>>> [32933.413984] ret_from_fork+0x22/0x30
>>>> [32933.418074] </TASK>
>>>>
>>>> The call trace shows up again at 245, 368, and 491 seconds. Same
>>>> task, same trace.
>>>>
>>>>
>>>>
>>>>
>>>
>>> That's very helpful. The above trace suggests that the RDMA code is
>>> leaking a call to xprt_unpin_rqst().
>>
>> IMHO this is unlikely to be related to the performance
>> regression -- none of this code has changed in the past 5
>> kernel releases. Could be a different issue, though.
>>
>> As is often the case in these situations, the INFO trace
>> above happens long after the issue that caused the missing
>> unpin. So... unless Dennis has a reproducer that can trigger
>> the issue frequently, I don't think there's much that can
>> be extracted from that.
>>
>> Also "nfs: RPC call returned error 512" suggests someone
>> hit ^C at some point. It's always possible that the
>> xprt_rdma_free() path is missing an unpin. But again,
>> that's not likely to be related to performance.
>>
>>
>>> From what I can see, rpcrdma_reply_handler() will always call
>>> xprt_pin_rqst(), and that call is required to be matched by a call
>>> to
>>> xprt_unpin_rqst() before the handler exits. However those calls to
>>> xprt_unpin_rqst() appear to be scattered around the code, and it is
>>> not
>>> at all obvious that the requirement is being fulfilled.
>>
>> I don't think unpin is required before rpcrdma_reply_handler()
>> exits. It /is/ required after xprt_complete_rqst(), and
>> sometimes that call has to be deferred until the NIC is
>> fully done with the RPC's buffers. The NIC is often not
>> finished with the buffers in the Receive completion handler.
>>
>>
>>> Chuck, why does that code need to be so complex?
>>
>> The overall goal is that the upper layer cannot be awoken
>> until the NIC has finished with the RPC's header and
>> payload buffers. These buffers are treated differently
>> depending on how the send path chose to marshal the RPC.
>>
>> And, all of these are hot paths, so it's important to
>> ensure that we are waiting/context-switching as
>> infrequently as possible. We wake the upper layer waiter
>> during an RDMA completion to reduce context switching.
>> The first completion does not always mean there's nothing
>> left for the NIC to do.
>>
>> Ignoring the server backchannel, there is a single unpin
>> in the "complete without error" case in
>> rpcrdma_complete_rqst().
>>
>> There are three call sites for rpcrdma_complete_rqst:
>>
>> 1. rpcrdma_reply_done - called when no explicit invalidation
>> is needed and the Receive completion fired after the Send
>> completion.
>>
>> 2. rpcrdma_sendctx_done - called when no explicit invalidation
>> is needed and the Send completion fired after the Receive
>> completion.
>>
>> 3. frwr_wc_localinv_done - called when explicit invalidation
>> was needed and that invalidation has completed. SQ ordering
>> guarantees that the Send has also completed at this point,
>> so a separate wait for Send completion is not needed.
>>
>> These are the points when the NIC has finished with all of
>> the memory resources associated with an RPC, so that's
>> when it is safe to wake the waiter and unpin the RPC.
>>
>
> From what I can see, what's important in the case of RDMA is to ensure
> that we call exactly 1 of either frwr_unmap_async(), frwr_unmap_sync()
> or rpcrdma_complete_rqst(),

No, all of these mechanisms are supposed to end up at
rpcrdma_complete_rqst(). That's where the RPC/RDMA transport
header is parsed.

Reply completion processing is kicked off by a Receive completion.
If local invalidation is unneeded, then we have to wait explicitly
for the Send completion too. Otherwise, a LOCAL_INV is posted and
we wait for that to complete; that completion means the previous
Send is complete, so waiting for Send completion in that case is
unnecessary.

When all of that dust settles, rpcrdma_complete_rqst() is called
to wake up the upper layer. That's basically how xprtsock did it
when I wrote all this years ago.

(The Send completion is needed because the NIC is free to continue
to read from sndbuf until then. We can't release that buffer until
completion because otherwise a NIC retransmission could expose
the different contents on the wire, or worse, the receiving end
would see garbage. Works a bit like TCP retransmission).

> and that whichever operation we select must
> complete before we call xprt->ops->free_slot().

The NIC has to tell xprtrdma that its memory operations are
complete before RPC can XDR decode either the RPC header or
the NFS request itself. That's well before ->free_slot(), IIRC.

> xprt_pin_rqst()/xprt_unpin_rqst() can help to ensure exclusion between
> those 3 operations, but it is not required if there are other ways to
> do so.
> In fact since xprt_wait_on_pinned_rqst() forces a synchronous sleep in
> xprt_release(), it would be much preferable to find a better solution
> than to pin the request if you concern is performance.

In particular, my performance concern is to avoid waits and
context switches, because those add latency, sometimes quite a
bit. I've tried to build the completion process on only the
unavoidable context switches. I don't immediately see how to
do this by adding wait_on_pinned to the fray.

frwr_unmap_sync() is used only to deal with ^C, so it can be
ugly and/or slow.

I am open to discussion about the structure of this code. The
simpler the better, but there are lots of moving parts. So
feedback is always welcome.

--
Chuck Lever

2022-05-14 04:15:37

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

> On May 13, 2022, at 9:30 AM, Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2022-05-13 at 07:58 -0400, Dennis Dalessandro wrote:
>> On 5/6/22 9:24 AM, Dennis Dalessandro wrote:
>>> On 4/29/22 9:37 AM, Chuck Lever III wrote:
>>>>
>>>>
>>>>> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro
>>>>> <[email protected]> wrote:
>>>>>
>>>>> On 4/28/22 3:56 PM, Trond Myklebust wrote:
>>>>>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote:
>>>>>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote:
>>>>>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote:
>>>>>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi NFS folks,
>>>>>>>>>>
>>>>>>>>>> I've noticed a pretty nasty regression in our NFS
>>>>>>>>>> capability
>>>>>>>>>> between 5.17 and
>>>>>>>>>> 5.18-rc1. I've tried to bisect but not having any
>>>>>>>>>> luck. The
>>>>>>>>>> problem I'm seeing
>>>>>>>>>> is it takes 3 minutes to copy a file from NFS to the
>>>>>>>>>> local
>>>>>>>>>> disk. When it should
>>>>>>>>>> take less than half a second, which it did up through
>>>>>>>>>> 5.17.
>>>>>>>>>>
>>>>>>>>>> It doesn't seem to be network related, but can't rule
>>>>>>>>>> that out
>>>>>>>>>> completely.
>>>>>>>>>>
>>>>>>>>>> I tried to bisect but the problem can be
>>>>>>>>>> intermittent. Some
>>>>>>>>>> runs I'll see a
>>>>>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of
>>>>>>>>>> 100.
>>>>>>>>>> Sometimes I'll see it
>>>>>>>>>> 100 out of 100.
>>>>>>>>>
>>>>>>>>> It's not clear from your problem report whether the
>>>>>>>>> problem
>>>>>>>>> appears
>>>>>>>>> when it's the server running v5.18-rc or the client.
>>>>>>>>
>>>>>>>> That's because I don't know which it is. I'll do a quick
>>>>>>>> test and
>>>>>>>> find out. I
>>>>>>>> was testing the same kernel across both nodes.
>>>>>>>
>>>>>>> Looks like it is the client.
>>>>>>>
>>>>>>> server client result
>>>>>>> ------ ------ ------
>>>>>>> 5.17 5.17 Pass
>>>>>>> 5.17 5.18 Fail
>>>>>>> 5.18 5.18 Fail
>>>>>>> 5.18 5.17 Pass
>>>>>>>
>>>>>>> Is there a patch for the client issue you mentioned that I
>>>>>>> could try?
>>>>>>>
>>>>>>> -Denny
>>>>>>
>>>>>> Try this one
>>>>>
>>>>> Thanks for the patch. Unfortunately it doesn't seem to solve
>>>>> the issue, still
>>>>> see intermittent hangs. I applied it on top of -rc4:
>>>>>
>>>>> copy /mnt/nfs_test/run_nfs_test.junk to
>>>>> /dev/shm/run_nfs_test.tmp...
>>>>>
>>>>> real 2m6.072s
>>>>> user 0m0.002s
>>>>> sys 0m0.263s
>>>>> Done
>>>>>
>>>>> While it was hung I checked the mem usage on the machine:
>>>>>
>>>>> # free -h
>>>>> total used free shared
>>>>> buff/cache available
>>>>> Mem: 62Gi 871Mi 61Gi 342Mi
>>>>> 889Mi 61Gi
>>>>> Swap: 4.0Gi 0B 4.0Gi
>>>>>
>>>>> Doesn't appear to be under memory pressure.
>>>>
>>>> Hi, since you know now that it is the client, perhaps a bisect
>>>> would be more successful?
>>>
>>> I've been testing all week. I pulled the nfs-rdma tree that was
>>> sent to Linus
>>> for 5.18 and tested. I see the problem on pretty much all the
>>> patches. However
>>> it's the frequency that it hits which changes.
>>>
>>> I'll see 1-5 cycles out of 2500 where the copy takes minutes up to:
>>> "NFS: Convert readdir page cache to use a cookie based index"
>>>
>>> After this I start seeing it around 10 times in 500 and by the last
>>> patch 10
>>> times in less than 100.
>>>
>>> Is there any kind of tracing/debugging I could turn on to get more
>>> insight on
>>> what is taking so long when it does go bad?
>>>
>>
>> Ran a test with -rc6 and this time see a hung task trace on the
>> console as well
>> as an NFS RPC error.
>>
>> [32719.991175] nfs: RPC call returned error 512
>> .
>> .
>> .
>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>> than 122 seconds.
>> [32933.293543] Tainted: G S 5.18.0-rc6 #1
>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this
>> message.
>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141
>> ppid: 2
>> flags:0x00004000
>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>> [32933.324524] Call Trace:
>> [32933.327347] <TASK>
>> [32933.329785] __schedule+0x3dd/0x970
>> [32933.333783] schedule+0x41/0xa0
>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>> [32933.343639] ? prepare_to_wait+0xd0/0xd0
>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
>> [32933.376564] ? __switch_to_asm+0x42/0x70
>> [32933.381046] ? finish_task_switch+0xb2/0x2c0
>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
>> [32933.391391] process_one_work+0x1c8/0x390
>> [32933.395975] worker_thread+0x30/0x360
>> [32933.400162] ? process_one_work+0x390/0x390
>> [32933.404931] kthread+0xd9/0x100
>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
>> [32933.413984] ret_from_fork+0x22/0x30
>> [32933.418074] </TASK>
>>
>> The call trace shows up again at 245, 368, and 491 seconds. Same
>> task, same trace.
>>
>>
>>
>>
>
> That's very helpful. The above trace suggests that the RDMA code is
> leaking a call to xprt_unpin_rqst().

IMHO this is unlikely to be related to the performance
regression -- none of this code has changed in the past 5
kernel releases. Could be a different issue, though.

As is often the case in these situations, the INFO trace
above happens long after the issue that caused the missing
unpin. So... unless Dennis has a reproducer that can trigger
the issue frequently, I don't think there's much that can
be extracted from that.

Also "nfs: RPC call returned error 512" suggests someone
hit ^C at some point. It's always possible that the
xprt_rdma_free() path is missing an unpin. But again,
that's not likely to be related to performance.

> From what I can see, rpcrdma_reply_handler() will always call
> xprt_pin_rqst(), and that call is required to be matched by a call to
> xprt_unpin_rqst() before the handler exits. However those calls to
> xprt_unpin_rqst() appear to be scattered around the code, and it is not
> at all obvious that the requirement is being fulfilled.

I don't think unpin is required before rpcrdma_reply_handler()
exits. It /is/ required after xprt_complete_rqst(), and
sometimes that call has to be deferred until the NIC is
fully done with the RPC's buffers. The NIC is often not
finished with the buffers in the Receive completion handler.

> Chuck, why does that code need to be so complex?

The overall goal is that the upper layer cannot be awoken
until the NIC has finished with the RPC's header and
payload buffers. These buffers are treated differently
depending on how the send path chose to marshal the RPC.

And, all of these are hot paths, so it's important to
ensure that we are waiting/context-switching as
infrequently as possible. We wake the upper layer waiter
during an RDMA completion to reduce context switching.
The first completion does not always mean there's nothing
left for the NIC to do.

Ignoring the server backchannel, there is a single unpin
in the "complete without error" case in
rpcrdma_complete_rqst().

There are three call sites for rpcrdma_complete_rqst:

1. rpcrdma_reply_done - called when no explicit invalidation
is needed and the Receive completion fired after the Send
completion.

2. rpcrdma_sendctx_done - called when no explicit invalidation
is needed and the Send completion fired after the Receive
completion.

3. frwr_wc_localinv_done - called when explicit invalidation
was needed and that invalidation has completed. SQ ordering
guarantees that the Send has also completed at this point,
so a separate wait for Send completion is not needed.

These are the points when the NIC has finished with all of
the memory resources associated with an RPC, so that's
when it is safe to wake the waiter and unpin the RPC.

--
Chuck Lever

2022-05-17 16:52:44

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

> On May 17, 2022, at 9:40 AM, Dennis Dalessandro <[email protected]> wrote:
>
> On 5/13/22 10:59 AM, Chuck Lever III wrote:
>>>>
>>>> Ran a test with -rc6 and this time see a hung task trace on the
>>>> console as well
>>>> as an NFS RPC error.
>>>>
>>>> [32719.991175] nfs: RPC call returned error 512
>>>> .
>>>> .
>>>> .
>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>>>> than 122 seconds.
>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1
>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this
>>>> message.
>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141
>>>> ppid: 2
>>>> flags:0x00004000
>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>> [32933.324524] Call Trace:
>>>> [32933.327347] <TASK>
>>>> [32933.329785] __schedule+0x3dd/0x970
>>>> [32933.333783] schedule+0x41/0xa0
>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0
>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
>>>> [32933.376564] ? __switch_to_asm+0x42/0x70
>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0
>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
>>>> [32933.391391] process_one_work+0x1c8/0x390
>>>> [32933.395975] worker_thread+0x30/0x360
>>>> [32933.400162] ? process_one_work+0x390/0x390
>>>> [32933.404931] kthread+0xd9/0x100
>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
>>>> [32933.413984] ret_from_fork+0x22/0x30
>>>> [32933.418074] </TASK>
>>>>
>>>> The call trace shows up again at 245, 368, and 491 seconds. Same
>>>> task, same trace.
>>>>
>>>>
>>>>
>>>>
>>>
>>> That's very helpful. The above trace suggests that the RDMA code is
>>> leaking a call to xprt_unpin_rqst().
>>
>> IMHO this is unlikely to be related to the performance
>> regression -- none of this code has changed in the past 5
>> kernel releases. Could be a different issue, though.
>>
>> As is often the case in these situations, the INFO trace
>> above happens long after the issue that caused the missing
>> unpin. So... unless Dennis has a reproducer that can trigger
>> the issue frequently, I don't think there's much that can
>> be extracted from that.
>
> To be fair, I've only seen this one time and have had the performance regression
> since -rc1.
>
>> Also "nfs: RPC call returned error 512" suggests someone
>> hit ^C at some point. It's always possible that the
>> xprt_rdma_free() path is missing an unpin. But again,
>> that's not likely to be related to performance.
>
> I've checked our test code and after 10 minutes it does give up trying to do the
> NFS copies and aborts (SIG_INT) the test.

After sleeping on it, I'm fairly certain the stack trace
above is a result of a gap in how xprtrdma handles a
signaled RPC.

Signal handling in that code is pretty hard to test, so not
surprising that there's a lingering bug or two. One idea I
had was to add a fault injector in the RPC scheduler to
throw signals at random. I think it can be done without
perturbing the hot path. Maybe I'll post an RFC patch.

> So in all my tests and bisect attempts it seems the possibility to hit a slow
> NFS operation that hangs for minutes has been possible for quite some time.
> However in 5.18 it gets much worse.
>
> Any likely places I should add traces to try and find out what's stuck or taking
> time?

There's been a lot of churn in that area in recent releases,
so I'm not familiar with the existing tracepoints. Maybe
Ben or Trond could provide some guidance.

--
Chuck Lever

2022-05-17 21:27:01

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 5/13/22 10:59 AM, Chuck Lever III wrote:
>>>
>>> Ran a test with -rc6 and this time see a hung task trace on the
>>> console as well
>>> as an NFS RPC error.
>>>
>>> [32719.991175] nfs: RPC call returned error 512
>>> .
>>> .
>>> .
>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>>> than 122 seconds.
>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1
>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this
>>> message.
>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141
>>> ppid: 2
>>> flags:0x00004000
>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>> [32933.324524] Call Trace:
>>> [32933.327347] <TASK>
>>> [32933.329785] __schedule+0x3dd/0x970
>>> [32933.333783] schedule+0x41/0xa0
>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0
>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
>>> [32933.376564] ? __switch_to_asm+0x42/0x70
>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0
>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
>>> [32933.391391] process_one_work+0x1c8/0x390
>>> [32933.395975] worker_thread+0x30/0x360
>>> [32933.400162] ? process_one_work+0x390/0x390
>>> [32933.404931] kthread+0xd9/0x100
>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
>>> [32933.413984] ret_from_fork+0x22/0x30
>>> [32933.418074] </TASK>
>>>
>>> The call trace shows up again at 245, 368, and 491 seconds. Same
>>> task, same trace.
>>>
>>>
>>>
>>>
>>
>> That's very helpful. The above trace suggests that the RDMA code is
>> leaking a call to xprt_unpin_rqst().
>
> IMHO this is unlikely to be related to the performance
> regression -- none of this code has changed in the past 5
> kernel releases. Could be a different issue, though.
>
> As is often the case in these situations, the INFO trace
> above happens long after the issue that caused the missing
> unpin. So... unless Dennis has a reproducer that can trigger
> the issue frequently, I don't think there's much that can
> be extracted from that.

To be fair, I've only seen this one time and have had the performance regression
since -rc1.

> Also "nfs: RPC call returned error 512" suggests someone
> hit ^C at some point. It's always possible that the
> xprt_rdma_free() path is missing an unpin. But again,
> that's not likely to be related to performance.

I've checked our test code and after 10 minutes it does give up trying to do the
NFS copies and aborts (SIG_INT) the test.

So in all my tests and bisect attempts it seems the possibility to hit a slow
NFS operation that hangs for minutes has been possible for quite some time.
However in 5.18 it gets much worse.

Any likely places I should add traces to try and find out what's stuck or taking
time?

-Denny

2022-06-20 07:52:12

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

Dennis, Chuck, I have below issue on the list of tracked regressions.
What's the status? Has any progress been made? Or is this not really a
regression and can be ignored?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

#regzbot poke
##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890

On 17.05.22 16:02, Chuck Lever III wrote:
>> On May 17, 2022, at 9:40 AM, Dennis Dalessandro <[email protected]> wrote:
>>
>> On 5/13/22 10:59 AM, Chuck Lever III wrote:
>>>>>
>>>>> Ran a test with -rc6 and this time see a hung task trace on the
>>>>> console as well
>>>>> as an NFS RPC error.
>>>>>
>>>>> [32719.991175] nfs: RPC call returned error 512
>>>>> .
>>>>> .
>>>>> .
>>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>>>>> than 122 seconds.
>>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1
>>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this
>>>>> message.
>>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141
>>>>> ppid: 2
>>>>> flags:0x00004000
>>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>> [32933.324524] Call Trace:
>>>>> [32933.327347] <TASK>
>>>>> [32933.329785] __schedule+0x3dd/0x970
>>>>> [32933.333783] schedule+0x41/0xa0
>>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0
>>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
>>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
>>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
>>>>> [32933.376564] ? __switch_to_asm+0x42/0x70
>>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0
>>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
>>>>> [32933.391391] process_one_work+0x1c8/0x390
>>>>> [32933.395975] worker_thread+0x30/0x360
>>>>> [32933.400162] ? process_one_work+0x390/0x390
>>>>> [32933.404931] kthread+0xd9/0x100
>>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
>>>>> [32933.413984] ret_from_fork+0x22/0x30
>>>>> [32933.418074] </TASK>
>>>>>
>>>>> The call trace shows up again at 245, 368, and 491 seconds. Same
>>>>> task, same trace.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> That's very helpful. The above trace suggests that the RDMA code is
>>>> leaking a call to xprt_unpin_rqst().
>>>
>>> IMHO this is unlikely to be related to the performance
>>> regression -- none of this code has changed in the past 5
>>> kernel releases. Could be a different issue, though.
>>>
>>> As is often the case in these situations, the INFO trace
>>> above happens long after the issue that caused the missing
>>> unpin. So... unless Dennis has a reproducer that can trigger
>>> the issue frequently, I don't think there's much that can
>>> be extracted from that.
>>
>> To be fair, I've only seen this one time and have had the performance regression
>> since -rc1.
>>
>>> Also "nfs: RPC call returned error 512" suggests someone
>>> hit ^C at some point. It's always possible that the
>>> xprt_rdma_free() path is missing an unpin. But again,
>>> that's not likely to be related to performance.
>>
>> I've checked our test code and after 10 minutes it does give up trying to do the
>> NFS copies and aborts (SIG_INT) the test.
>
> After sleeping on it, I'm fairly certain the stack trace
> above is a result of a gap in how xprtrdma handles a
> signaled RPC.
>
> Signal handling in that code is pretty hard to test, so not
> surprising that there's a lingering bug or two. One idea I
> had was to add a fault injector in the RPC scheduler to
> throw signals at random. I think it can be done without
> perturbing the hot path. Maybe I'll post an RFC patch.
>
>
>> So in all my tests and bisect attempts it seems the possibility to hit a slow
>> NFS operation that hangs for minutes has been possible for quite some time.
>> However in 5.18 it gets much worse.
>>
>> Any likely places I should add traces to try and find out what's stuck or taking
>> time?
>
> There's been a lot of churn in that area in recent releases,
> so I'm not familiar with the existing tracepoints. Maybe
> Ben or Trond could provide some guidance.

2022-06-20 15:06:32

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 20.06.22 16:11, Chuck Lever III wrote:
>
>
>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <[email protected]> wrote:
>>
>> Dennis, Chuck, I have below issue on the list of tracked regressions.
>> What's the status? Has any progress been made? Or is this not really a
>> regression and can be ignored?
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>
>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>> reports and sometimes miss something important when writing mails like
>> this. If that's the case here, don't hesitate to tell me in a public
>> reply, it's in everyone's interest to set the public record straight.
>>
>> #regzbot poke
>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
>
> The above link points to an Apple trackpad bug.

Yeah, I know, sorry, should have mentioned: either I or my bot did
something stupid and associated that report with this regression, that's
why I deassociated it with the "unlink" command.

> The bug described all the way at the bottom was the origin problem
> report. I believe this is an NFS client issue. We are waiting for
> a response from the NFS client maintainers to help Dennis track
> this down.

Many thx for the status update. Can anything be done to speed things up?
This is taken quite a long time already -- way longer that outlined in
"Prioritize work on fixing regressions" here:
https://docs.kernel.org/process/handling-regressions.html

Ciao, Thorsten

>> On 17.05.22 16:02, Chuck Lever III wrote:
>>>> On May 17, 2022, at 9:40 AM, Dennis Dalessandro <[email protected]> wrote:
>>>>
>>>> On 5/13/22 10:59 AM, Chuck Lever III wrote:
>>>>>>>
>>>>>>> Ran a test with -rc6 and this time see a hung task trace on the
>>>>>>> console as well
>>>>>>> as an NFS RPC error.
>>>>>>>
>>>>>>> [32719.991175] nfs: RPC call returned error 512
>>>>>>> .
>>>>>>> .
>>>>>>> .
>>>>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>>>>>>> than 122 seconds.
>>>>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1
>>>>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this
>>>>>>> message.
>>>>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141
>>>>>>> ppid: 2
>>>>>>> flags:0x00004000
>>>>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>>>> [32933.324524] Call Trace:
>>>>>>> [32933.327347] <TASK>
>>>>>>> [32933.329785] __schedule+0x3dd/0x970
>>>>>>> [32933.333783] schedule+0x41/0xa0
>>>>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>>>>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0
>>>>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
>>>>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
>>>>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
>>>>>>> [32933.376564] ? __switch_to_asm+0x42/0x70
>>>>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0
>>>>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
>>>>>>> [32933.391391] process_one_work+0x1c8/0x390
>>>>>>> [32933.395975] worker_thread+0x30/0x360
>>>>>>> [32933.400162] ? process_one_work+0x390/0x390
>>>>>>> [32933.404931] kthread+0xd9/0x100
>>>>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
>>>>>>> [32933.413984] ret_from_fork+0x22/0x30
>>>>>>> [32933.418074] </TASK>
>>>>>>>
>>>>>>> The call trace shows up again at 245, 368, and 491 seconds. Same
>>>>>>> task, same trace.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That's very helpful. The above trace suggests that the RDMA code is
>>>>>> leaking a call to xprt_unpin_rqst().
>>>>>
>>>>> IMHO this is unlikely to be related to the performance
>>>>> regression -- none of this code has changed in the past 5
>>>>> kernel releases. Could be a different issue, though.
>>>>>
>>>>> As is often the case in these situations, the INFO trace
>>>>> above happens long after the issue that caused the missing
>>>>> unpin. So... unless Dennis has a reproducer that can trigger
>>>>> the issue frequently, I don't think there's much that can
>>>>> be extracted from that.
>>>>
>>>> To be fair, I've only seen this one time and have had the performance regression
>>>> since -rc1.
>>>>
>>>>> Also "nfs: RPC call returned error 512" suggests someone
>>>>> hit ^C at some point. It's always possible that the
>>>>> xprt_rdma_free() path is missing an unpin. But again,
>>>>> that's not likely to be related to performance.
>>>>
>>>> I've checked our test code and after 10 minutes it does give up trying to do the
>>>> NFS copies and aborts (SIG_INT) the test.
>>>
>>> After sleeping on it, I'm fairly certain the stack trace
>>> above is a result of a gap in how xprtrdma handles a
>>> signaled RPC.
>>>
>>> Signal handling in that code is pretty hard to test, so not
>>> surprising that there's a lingering bug or two. One idea I
>>> had was to add a fault injector in the RPC scheduler to
>>> throw signals at random. I think it can be done without
>>> perturbing the hot path. Maybe I'll post an RFC patch.
>>>
>>>
>>>> So in all my tests and bisect attempts it seems the possibility to hit a slow
>>>> NFS operation that hangs for minutes has been possible for quite some time.
>>>> However in 5.18 it gets much worse.
>>>>
>>>> Any likely places I should add traces to try and find out what's stuck or taking
>>>> time?
>>>
>>> There's been a lot of churn in that area in recent releases,
>>> so I'm not familiar with the existing tracepoints. Maybe
>>> Ben or Trond could provide some guidance.
>>
>
> --
> Chuck Lever
>
>
>
>

2022-06-20 15:07:04

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

Hi Thorsten-

> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <[email protected]> wrote:
>
> On 20.06.22 16:11, Chuck Lever III wrote:
>>
>>
>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <[email protected]> wrote:
>>>
>>> Dennis, Chuck, I have below issue on the list of tracked regressions.
>>> What's the status? Has any progress been made? Or is this not really a
>>> regression and can be ignored?
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>
>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>>> reports and sometimes miss something important when writing mails like
>>> this. If that's the case here, don't hesitate to tell me in a public
>>> reply, it's in everyone's interest to set the public record straight.
>>>
>>> #regzbot poke
>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
>>
>> The above link points to an Apple trackpad bug.
>
> Yeah, I know, sorry, should have mentioned: either I or my bot did
> something stupid and associated that report with this regression, that's
> why I deassociated it with the "unlink" command.

Is there an open bugzilla for the original regression?

>> The bug described all the way at the bottom was the origin problem
>> report. I believe this is an NFS client issue. We are waiting for
>> a response from the NFS client maintainers to help Dennis track
>> this down.
>
> Many thx for the status update. Can anything be done to speed things up?
> This is taken quite a long time already -- way longer that outlined in
> "Prioritize work on fixing regressions" here:
> https://docs.kernel.org/process/handling-regressions.html

ENOTMYMONKEYS ;-)

I was involved to help with the ^C issue that happened while
Dennis was troubleshooting. It's not related to the original
regression, which needs to be pursued by the NFS client
maintainers.

The correct people to poke are Trond, Olga (both cc'd) and
Anna Schumaker.

--
Chuck Lever

2022-06-20 15:58:04

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <[email protected]> wrote:
>
> Dennis, Chuck, I have below issue on the list of tracked regressions.
> What's the status? Has any progress been made? Or is this not really a
> regression and can be ignored?
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.
>
> #regzbot poke
> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890

The above link points to an Apple trackpad bug.

The bug described all the way at the bottom was the origin problem
report. I believe this is an NFS client issue. We are waiting for
a response from the NFS client maintainers to help Dennis track
this down.

> On 17.05.22 16:02, Chuck Lever III wrote:
>>> On May 17, 2022, at 9:40 AM, Dennis Dalessandro <[email protected]> wrote:
>>>
>>> On 5/13/22 10:59 AM, Chuck Lever III wrote:
>>>>>>
>>>>>> Ran a test with -rc6 and this time see a hung task trace on the
>>>>>> console as well
>>>>>> as an NFS RPC error.
>>>>>>
>>>>>> [32719.991175] nfs: RPC call returned error 512
>>>>>> .
>>>>>> .
>>>>>> .
>>>>>> [32933.285126] INFO: task kworker/u145:23:886141 blocked for more
>>>>>> than 122 seconds.
>>>>>> [32933.293543] Tainted: G S 5.18.0-rc6 #1
>>>>>> [32933.299869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this
>>>>>> message.
>>>>>> [32933.308740] task:kworker/u145:23 state:D stack: 0 pid:886141
>>>>>> ppid: 2
>>>>>> flags:0x00004000
>>>>>> [32933.318321] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>>> [32933.324524] Call Trace:
>>>>>> [32933.327347] <TASK>
>>>>>> [32933.329785] __schedule+0x3dd/0x970
>>>>>> [32933.333783] schedule+0x41/0xa0
>>>>>> [32933.337388] xprt_request_dequeue_xprt+0xd1/0x140 [sunrpc]
>>>>>> [32933.343639] ? prepare_to_wait+0xd0/0xd0
>>>>>> [32933.348123] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>>>> [32933.354183] xprt_release+0x26/0x140 [sunrpc]
>>>>>> [32933.359168] ? rpc_destroy_wait_queue+0x10/0x10 [sunrpc]
>>>>>> [32933.365225] rpc_release_resources_task+0xe/0x50 [sunrpc]
>>>>>> [32933.371381] __rpc_execute+0x2c5/0x4e0 [sunrpc]
>>>>>> [32933.376564] ? __switch_to_asm+0x42/0x70
>>>>>> [32933.381046] ? finish_task_switch+0xb2/0x2c0
>>>>>> [32933.385918] rpc_async_schedule+0x29/0x40 [sunrpc]
>>>>>> [32933.391391] process_one_work+0x1c8/0x390
>>>>>> [32933.395975] worker_thread+0x30/0x360
>>>>>> [32933.400162] ? process_one_work+0x390/0x390
>>>>>> [32933.404931] kthread+0xd9/0x100
>>>>>> [32933.408536] ? kthread_complete_and_exit+0x20/0x20
>>>>>> [32933.413984] ret_from_fork+0x22/0x30
>>>>>> [32933.418074] </TASK>
>>>>>>
>>>>>> The call trace shows up again at 245, 368, and 491 seconds. Same
>>>>>> task, same trace.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> That's very helpful. The above trace suggests that the RDMA code is
>>>>> leaking a call to xprt_unpin_rqst().
>>>>
>>>> IMHO this is unlikely to be related to the performance
>>>> regression -- none of this code has changed in the past 5
>>>> kernel releases. Could be a different issue, though.
>>>>
>>>> As is often the case in these situations, the INFO trace
>>>> above happens long after the issue that caused the missing
>>>> unpin. So... unless Dennis has a reproducer that can trigger
>>>> the issue frequently, I don't think there's much that can
>>>> be extracted from that.
>>>
>>> To be fair, I've only seen this one time and have had the performance regression
>>> since -rc1.
>>>
>>>> Also "nfs: RPC call returned error 512" suggests someone
>>>> hit ^C at some point. It's always possible that the
>>>> xprt_rdma_free() path is missing an unpin. But again,
>>>> that's not likely to be related to performance.
>>>
>>> I've checked our test code and after 10 minutes it does give up trying to do the
>>> NFS copies and aborts (SIG_INT) the test.
>>
>> After sleeping on it, I'm fairly certain the stack trace
>> above is a result of a gap in how xprtrdma handles a
>> signaled RPC.
>>
>> Signal handling in that code is pretty hard to test, so not
>> surprising that there's a lingering bug or two. One idea I
>> had was to add a fault injector in the RPC scheduler to
>> throw signals at random. I think it can be done without
>> perturbing the hot path. Maybe I'll post an RFC patch.
>>
>>
>>> So in all my tests and bisect attempts it seems the possibility to hit a slow
>>> NFS operation that hangs for minutes has been possible for quite some time.
>>> However in 5.18 it gets much worse.
>>>
>>> Any likely places I should add traces to try and find out what's stuck or taking
>>> time?
>>
>> There's been a lot of churn in that area in recent releases,
>> so I'm not familiar with the existing tracepoints. Maybe
>> Ben or Trond could provide some guidance.
>

--
Chuck Lever

2022-06-20 17:17:03

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 6/20/22 10:40 AM, Chuck Lever III wrote:
> Hi Thorsten-
>
>> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <[email protected]> wrote:
>>
>> On 20.06.22 16:11, Chuck Lever III wrote:
>>>
>>>
>>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <[email protected]> wrote:
>>>>
>>>> Dennis, Chuck, I have below issue on the list of tracked regressions.
>>>> What's the status? Has any progress been made? Or is this not really a
>>>> regression and can be ignored?
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>
>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>>>> reports and sometimes miss something important when writing mails like
>>>> this. If that's the case here, don't hesitate to tell me in a public
>>>> reply, it's in everyone's interest to set the public record straight.
>>>>
>>>> #regzbot poke
>>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
>>>
>>> The above link points to an Apple trackpad bug.
>>
>> Yeah, I know, sorry, should have mentioned: either I or my bot did
>> something stupid and associated that report with this regression, that's
>> why I deassociated it with the "unlink" command.
>
> Is there an open bugzilla for the original regression?
>
>
>>> The bug described all the way at the bottom was the origin problem
>>> report. I believe this is an NFS client issue. We are waiting for
>>> a response from the NFS client maintainers to help Dennis track
>>> this down.
>>
>> Many thx for the status update. Can anything be done to speed things up?
>> This is taken quite a long time already -- way longer that outlined in
>> "Prioritize work on fixing regressions" here:
>> https://docs.kernel.org/process/handling-regressions.html
>
> ENOTMYMONKEYS ;-)
>
> I was involved to help with the ^C issue that happened while
> Dennis was troubleshooting. It's not related to the original
> regression, which needs to be pursued by the NFS client
> maintainers.
>
> The correct people to poke are Trond, Olga (both cc'd) and
> Anna Schumaker.

Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a
result of the test we were running taking too long. It times out after 10
minutes or so and kills the process. So a downstream effect of the regression.

The test is still continuing to fail as of 5.19-rc2. I'll double check that it's
the same issue and open a bugzilla.

Thanks for poking at this.

-Denny

2022-06-21 16:06:16

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

Hi Dennis,

Can I ask some basic questions? Have you tried to get any kinds of
profiling done to see where the client is spending time (using perf
perhaps)?

real 4m11.835s
user 0m0.001s
sys 0m0.277s

sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do
a network transfer (if we had a network trace we could see how long
network transfer were)? Do you have one (that goes along with
something that can tell us approximately when the request began from
the cp's perspective, like a date before hand)?

I see that there were no rdma changes that went into 5.18 kernel so
whatever changed either a generic nfs behaviour or perhaps something
in the rdma core code (is an mellonax card being used here?)

I wonder if the slowdown only happens on rdma or is it visible on the
tcp mount as well, have you tried?

On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro
<[email protected]> wrote:
>
> On 6/20/22 10:40 AM, Chuck Lever III wrote:
> > Hi Thorsten-
> >
> >> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <[email protected]> wrote:
> >>
> >> On 20.06.22 16:11, Chuck Lever III wrote:
> >>>
> >>>
> >>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <[email protected]> wrote:
> >>>>
> >>>> Dennis, Chuck, I have below issue on the list of tracked regressions.
> >>>> What's the status? Has any progress been made? Or is this not really a
> >>>> regression and can be ignored?
> >>>>
> >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>
> >>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> >>>> reports and sometimes miss something important when writing mails like
> >>>> this. If that's the case here, don't hesitate to tell me in a public
> >>>> reply, it's in everyone's interest to set the public record straight.
> >>>>
> >>>> #regzbot poke
> >>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
> >>>
> >>> The above link points to an Apple trackpad bug.
> >>
> >> Yeah, I know, sorry, should have mentioned: either I or my bot did
> >> something stupid and associated that report with this regression, that's
> >> why I deassociated it with the "unlink" command.
> >
> > Is there an open bugzilla for the original regression?
> >
> >
> >>> The bug described all the way at the bottom was the origin problem
> >>> report. I believe this is an NFS client issue. We are waiting for
> >>> a response from the NFS client maintainers to help Dennis track
> >>> this down.
> >>
> >> Many thx for the status update. Can anything be done to speed things up?
> >> This is taken quite a long time already -- way longer that outlined in
> >> "Prioritize work on fixing regressions" here:
> >> https://docs.kernel.org/process/handling-regressions.html
> >
> > ENOTMYMONKEYS ;-)
> >
> > I was involved to help with the ^C issue that happened while
> > Dennis was troubleshooting. It's not related to the original
> > regression, which needs to be pursued by the NFS client
> > maintainers.
> >
> > The correct people to poke are Trond, Olga (both cc'd) and
> > Anna Schumaker.
>
> Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a
> result of the test we were running taking too long. It times out after 10
> minutes or so and kills the process. So a downstream effect of the regression.
>
> The test is still continuing to fail as of 5.19-rc2. I'll double check that it's
> the same issue and open a bugzilla.
>
> Thanks for poking at this.
>
> -Denny

2022-06-21 17:01:38

by Dennis Dalessandro

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On 6/21/22 12:04 PM, Olga Kornievskaia wrote:
> Hi Dennis,
>
> Can I ask some basic questions? Have you tried to get any kinds of
> profiling done to see where the client is spending time (using perf
> perhaps)?
>
> real 4m11.835s
> user 0m0.001s
> sys 0m0.277s
>
> sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do
> a network transfer (if we had a network trace we could see how long
> network transfer were)? Do you have one (that goes along with
> something that can tell us approximately when the request began from
> the cp's perspective, like a date before hand)?
>
> I see that there were no rdma changes that went into 5.18 kernel so
> whatever changed either a generic nfs behaviour or perhaps something
> in the rdma core code (is an mellonax card being used here?)
>
> I wonder if the slowdown only happens on rdma or is it visible on the
> tcp mount as well, have you tried?
>

Hi Olga,

I have opened a Kernel Bugzilla if you would rather log future responses there:
https://bugzilla.kernel.org/show_bug.cgi?id=216160

To answer your above questions: This is on Omni-Path hardware. I have not tried
the TCP mount, I can though. I don't have any network trace per-se or a profile.
We don't support like a TCP dump or anything like that. However I can tell you
there is nothing going over the network while it appears to be hung. I can
monitor the packet counters.

If you have some ideas where I could put some trace points that could tell us
something I can certainly add those.

-Denny

>
>
> On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro
> <[email protected]> wrote:
>>
>> On 6/20/22 10:40 AM, Chuck Lever III wrote:
>>> Hi Thorsten-
>>>
>>>> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <[email protected]> wrote:
>>>>
>>>> On 20.06.22 16:11, Chuck Lever III wrote:
>>>>>
>>>>>
>>>>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <[email protected]> wrote:
>>>>>>
>>>>>> Dennis, Chuck, I have below issue on the list of tracked regressions.
>>>>>> What's the status? Has any progress been made? Or is this not really a
>>>>>> regression and can be ignored?
>>>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>
>>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>>>>>> reports and sometimes miss something important when writing mails like
>>>>>> this. If that's the case here, don't hesitate to tell me in a public
>>>>>> reply, it's in everyone's interest to set the public record straight.
>>>>>>
>>>>>> #regzbot poke
>>>>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
>>>>>
>>>>> The above link points to an Apple trackpad bug.
>>>>
>>>> Yeah, I know, sorry, should have mentioned: either I or my bot did
>>>> something stupid and associated that report with this regression, that's
>>>> why I deassociated it with the "unlink" command.
>>>
>>> Is there an open bugzilla for the original regression?
>>>
>>>
>>>>> The bug described all the way at the bottom was the origin problem
>>>>> report. I believe this is an NFS client issue. We are waiting for
>>>>> a response from the NFS client maintainers to help Dennis track
>>>>> this down.
>>>>
>>>> Many thx for the status update. Can anything be done to speed things up?
>>>> This is taken quite a long time already -- way longer that outlined in
>>>> "Prioritize work on fixing regressions" here:
>>>> https://docs.kernel.org/process/handling-regressions.html
>>>
>>> ENOTMYMONKEYS ;-)
>>>
>>> I was involved to help with the ^C issue that happened while
>>> Dennis was troubleshooting. It's not related to the original
>>> regression, which needs to be pursued by the NFS client
>>> maintainers.
>>>
>>> The correct people to poke are Trond, Olga (both cc'd) and
>>> Anna Schumaker.
>>
>> Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a
>> result of the test we were running taking too long. It times out after 10
>> minutes or so and kills the process. So a downstream effect of the regression.
>>
>> The test is still continuing to fail as of 5.19-rc2. I'll double check that it's
>> the same issue and open a bugzilla.
>>
>> Thanks for poking at this.
>>
>> -Denny

2022-06-21 18:01:08

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

On Tue, Jun 21, 2022 at 12:58 PM Dennis Dalessandro
<[email protected]> wrote:
>
> On 6/21/22 12:04 PM, Olga Kornievskaia wrote:
> > Hi Dennis,
> >
> > Can I ask some basic questions? Have you tried to get any kinds of
> > profiling done to see where the client is spending time (using perf
> > perhaps)?
> >
> > real 4m11.835s
> > user 0m0.001s
> > sys 0m0.277s
> >
> > sounds like 4ms are spent sleeping somewhere? Did it take 4mins to do
> > a network transfer (if we had a network trace we could see how long
> > network transfer were)? Do you have one (that goes along with
> > something that can tell us approximately when the request began from
> > the cp's perspective, like a date before hand)?
> >
> > I see that there were no rdma changes that went into 5.18 kernel so
> > whatever changed either a generic nfs behaviour or perhaps something
> > in the rdma core code (is an mellonax card being used here?)
> >
> > I wonder if the slowdown only happens on rdma or is it visible on the
> > tcp mount as well, have you tried?
> >
>
> Hi Olga,
>
> I have opened a Kernel Bugzilla if you would rather log future responses there:
> https://bugzilla.kernel.org/show_bug.cgi?id=216160
>
> To answer your above questions: This is on Omni-Path hardware. I have not tried
> the TCP mount, I can though. I don't have any network trace per-se or a profile.
> We don't support like a TCP dump or anything like that. However I can tell you
> there is nothing going over the network while it appears to be hung. I can
> monitor the packet counters.

In this thread there are 2 problems raised (1) performance regression
and (2) a single time run that hit a hung client.

For #1, given that there were no rdma changes that were added to the
5.18, seems like something in the generic nfs is causing issues for
you, thus I recommend to first use a linux profiler to get some
information about the times spent in kernel functions that are
triggered by the cp command. If you can't run the profiler (which I
think you should be able to), then perhaps just enable the nfs4 and
rpcrdma tracepoints which also have timestamps in them and looking at
differences can give some clue where the time is being spent. 4min is
a significant chunk of time and should be visible somewhere in those
timestamps.

For #2, I have personally ran into that stack trace while
investigating a hung using soft iWarp as the rdma provider. It was an
unpinned request but I think it was due to soft iWarp failure that
cause it not to do a completion to rdma which led to the request never
getting unpinned. Thus I would recommend looking into failures in your
rdma provider for clues on that problem.

> If you have some ideas where I could put some trace points that could tell us
> something I can certainly add those.
>
> -Denny
>
> >
> >
> > On Mon, Jun 20, 2022 at 1:06 PM Dennis Dalessandro
> > <[email protected]> wrote:
> >>
> >> On 6/20/22 10:40 AM, Chuck Lever III wrote:
> >>> Hi Thorsten-
> >>>
> >>>> On Jun 20, 2022, at 10:29 AM, Thorsten Leemhuis <[email protected]> wrote:
> >>>>
> >>>> On 20.06.22 16:11, Chuck Lever III wrote:
> >>>>>
> >>>>>
> >>>>>> On Jun 20, 2022, at 3:46 AM, Thorsten Leemhuis <[email protected]> wrote:
> >>>>>>
> >>>>>> Dennis, Chuck, I have below issue on the list of tracked regressions.
> >>>>>> What's the status? Has any progress been made? Or is this not really a
> >>>>>> regression and can be ignored?
> >>>>>>
> >>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>>>
> >>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> >>>>>> reports and sometimes miss something important when writing mails like
> >>>>>> this. If that's the case here, don't hesitate to tell me in a public
> >>>>>> reply, it's in everyone's interest to set the public record straight.
> >>>>>>
> >>>>>> #regzbot poke
> >>>>>> ##regzbot unlink: https://bugzilla.kernel.org/show_bug.cgi?id=215890
> >>>>>
> >>>>> The above link points to an Apple trackpad bug.
> >>>>
> >>>> Yeah, I know, sorry, should have mentioned: either I or my bot did
> >>>> something stupid and associated that report with this regression, that's
> >>>> why I deassociated it with the "unlink" command.
> >>>
> >>> Is there an open bugzilla for the original regression?
> >>>
> >>>
> >>>>> The bug described all the way at the bottom was the origin problem
> >>>>> report. I believe this is an NFS client issue. We are waiting for
> >>>>> a response from the NFS client maintainers to help Dennis track
> >>>>> this down.
> >>>>
> >>>> Many thx for the status update. Can anything be done to speed things up?
> >>>> This is taken quite a long time already -- way longer that outlined in
> >>>> "Prioritize work on fixing regressions" here:
> >>>> https://docs.kernel.org/process/handling-regressions.html
> >>>
> >>> ENOTMYMONKEYS ;-)
> >>>
> >>> I was involved to help with the ^C issue that happened while
> >>> Dennis was troubleshooting. It's not related to the original
> >>> regression, which needs to be pursued by the NFS client
> >>> maintainers.
> >>>
> >>> The correct people to poke are Trond, Olga (both cc'd) and
> >>> Anna Schumaker.
> >>
> >> Perhaps I should open a bugzilla for the regression. The Ctrl+C issue was a
> >> result of the test we were running taking too long. It times out after 10
> >> minutes or so and kills the process. So a downstream effect of the regression.
> >>
> >> The test is still continuing to fail as of 5.19-rc2. I'll double check that it's
> >> the same issue and open a bugzilla.
> >>
> >> Thanks for poking at this.
> >>
> >> -Denny

2022-06-21 18:05:27

by Chuck Lever III

[permalink] [raw]

Subject: Re: NFS regression between 5.17 and 5.18

> On Jun 21, 2022, at 1:51 PM, Olga Kornievskaia <[email protected]> wrote:
>
> For #2, I have personally ran into that stack trace while
> investigating a hung using soft iWarp as the rdma provider. It was an
> unpinned request but I think it was due to soft iWarp failure that
> cause it not to do a completion to rdma which led to the request never
> getting unpinned. Thus I would recommend looking into failures in your
> rdma provider for clues on that problem.

#2 is a separate problem entirely, and is a known bug. I plan to
add some fault injection so we can reproduce and track it down.
I think you should prune this issue and focus just on #1 for now.

--
Chuck Lever