2012-08-17 01:57:03

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 12.07.2012 16:53, J. Bruce Fields wrote:
> On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
>> I tried to debug this again, maybe to reproduce in a virtual machine,
>> and found out that it is only 32bit server code shows this issue:
>> after updating the kernel on the server to 64bit (the same version)
>> I can't reproduce this issue anymore. Rebooting back to 32bit,
>> and voila, it is here again.
>>
>> Something apparenlty isn't right on 32bits... ;)
>>
>> (And yes, the prob is still present and is very annoying :)
>
> OK, that's very useful, thanks. So probably a bug got introduced in the
> 32-bit case between 2.6.32 and 3.0.
>
> My personal upstream testing is normally all x86_64 only. I'll kick off
> a 32-bit install and see if I can reproduce this quickly.

Actually it has nothing to do with 32 vs 64 bits as I
initially thought. It happens on 64bits too, but takes
more time (or data to transfer) to trigger.


> Let me know if you're able to narrow this down any more.

I bisected this issue to the following commit:

commit f03d78db65085609938fdb686238867e65003181
Author: Eric Dumazet <[email protected]>
Date: Thu Jul 7 00:27:05 2011 -0700

net: refine {udp|tcp|sctp}_mem limits

Current tcp/udp/sctp global memory limits are not taking into account
hugepages allocations, and allow 50% of ram to be used by buffers of a
single protocol [ not counting space used by sockets / inodes ...]

Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
per protocol, and a minimum of 128 pages.
Heavy duty machines sysadmins probably need to tweak limits anyway.


Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
the behavour here.

This machine has 4Gb of memory. On 3.0, with this patch applied
(as it is part of 3.0), tcp_mem is like this:

21228 28306 42456

with this patch reverted, tcp_mem shows:

81216 108288 162432

and with these values, it works fine.

So it looks like something else goes wrong there,
which lead to all nfsds fighting with each other
for something and eating 100% of available CPU
instead of servicing clients.

For added fun, when setting tcp_mem to the "good" value
from "bad" value (after booting into kernel with that
patch applied), the problem is _not_ fixed.

Any further hints?

Thanks,

/mjt

>> On 31.05.2012 17:51, Michael Tokarev wrote:
>>> On 31.05.2012 17:46, Myklebust, Trond wrote:
>>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
>>> []
>>>>> I started tcpdump:
>>>>>
>>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>>>>
>>>>> on the client (192.168.88.2). Next I mounted a directory on the client,
>>>>> and started reading (tar'ing) a directory into /dev/null. It captured a
>>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at
>>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
>>>>>
>>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
>>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>>>>> take place on disk at all. This also makes it obvious that the issue
>>>>> does not depend on the speed of disk on the server (since in this case,
>>>>> the server disk isn't even in use).
>>>>
>>>> OK. So from the above file it looks as if the traffic is mainly READ
>>>> requests.
>>>
>>> The issue here happens only with reads.
>>>
>>>> In 2 places the server stops responding. In both cases, the client seems
>>>> to be sending a single TCP frame containing several COMPOUNDS containing
>>>> READ requests (which should be legal) just prior to the hang. When the
>>>> server doesn't respond, the client pings it with a RENEW, before it ends
>>>> up severing the TCP connection and then retransmitting.
>>>
>>> And sometimes -- speaking only from the behavour I've seen, not from the
>>> actual frames sent -- server does not respond to the RENEW too, in which
>>> case the client reports "nfs server no responding", and on the next
>>> renew it may actually respond. This happens too, but much more rare.
>>>
>>> During these stalls, ie, when there's no network activity at all,
>>> the server NFSD threads are busy eating all available CPU.
>>>
>>> What does it all tell us? :)
>>>
>>> Thank you!
>>>
>>> /mjt
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2012-08-17 14:56:39

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote:
> On 12.07.2012 16:53, J. Bruce Fields wrote:
> > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> >> I tried to debug this again, maybe to reproduce in a virtual machine,
> >> and found out that it is only 32bit server code shows this issue:
> >> after updating the kernel on the server to 64bit (the same version)
> >> I can't reproduce this issue anymore. Rebooting back to 32bit,
> >> and voila, it is here again.
> >>
> >> Something apparenlty isn't right on 32bits... ;)
> >>
> >> (And yes, the prob is still present and is very annoying :)
> >
> > OK, that's very useful, thanks. So probably a bug got introduced in the
> > 32-bit case between 2.6.32 and 3.0.
> >
> > My personal upstream testing is normally all x86_64 only. I'll kick off
> > a 32-bit install and see if I can reproduce this quickly.
>
> Actually it has nothing to do with 32 vs 64 bits as I
> initially thought. It happens on 64bits too, but takes
> more time (or data to transfer) to trigger.

That makes it sound like some kind of leak: you're hitting this case
eventually either way, but it takes longer in the case where you have
more (low) memory.

I wish I was more familiar with the tcp code.... What number exactly is
being compared against those limits, and how could we watch it from
userspace?

--b.

>
>
> > Let me know if you're able to narrow this down any more.
>
> I bisected this issue to the following commit:
>
> commit f03d78db65085609938fdb686238867e65003181
> Author: Eric Dumazet <[email protected]>
> Date: Thu Jul 7 00:27:05 2011 -0700
>
> net: refine {udp|tcp|sctp}_mem limits
>
> Current tcp/udp/sctp global memory limits are not taking into account
> hugepages allocations, and allow 50% of ram to be used by buffers of a
> single protocol [ not counting space used by sockets / inodes ...]
>
> Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
> per protocol, and a minimum of 128 pages.
> Heavy duty machines sysadmins probably need to tweak limits anyway.
>
>
> Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
> the behavour here.
>
> This machine has 4Gb of memory. On 3.0, with this patch applied
> (as it is part of 3.0), tcp_mem is like this:
>
> 21228 28306 42456
>
> with this patch reverted, tcp_mem shows:
>
> 81216 108288 162432
>
> and with these values, it works fine.
>
> So it looks like something else goes wrong there,
> which lead to all nfsds fighting with each other
> for something and eating 100% of available CPU
> instead of servicing clients.
>
> For added fun, when setting tcp_mem to the "good" value
> from "bad" value (after booting into kernel with that
> patch applied), the problem is _not_ fixed.
>
> Any further hints?
>
> Thanks,
>
> /mjt
>
> >> On 31.05.2012 17:51, Michael Tokarev wrote:
> >>> On 31.05.2012 17:46, Myklebust, Trond wrote:
> >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> >>> []
> >>>>> I started tcpdump:
> >>>>>
> >>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> >>>>>
> >>>>> on the client (192.168.88.2). Next I mounted a directory on the client,
> >>>>> and started reading (tar'ing) a directory into /dev/null. It captured a
> >>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at
> >>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
> >>>>>
> >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> >>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
> >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> >>>>> take place on disk at all. This also makes it obvious that the issue
> >>>>> does not depend on the speed of disk on the server (since in this case,
> >>>>> the server disk isn't even in use).
> >>>>
> >>>> OK. So from the above file it looks as if the traffic is mainly READ
> >>>> requests.
> >>>
> >>> The issue here happens only with reads.
> >>>
> >>>> In 2 places the server stops responding. In both cases, the client seems
> >>>> to be sending a single TCP frame containing several COMPOUNDS containing
> >>>> READ requests (which should be legal) just prior to the hang. When the
> >>>> server doesn't respond, the client pings it with a RENEW, before it ends
> >>>> up severing the TCP connection and then retransmitting.
> >>>
> >>> And sometimes -- speaking only from the behavour I've seen, not from the
> >>> actual frames sent -- server does not respond to the RENEW too, in which
> >>> case the client reports "nfs server no responding", and on the next
> >>> renew it may actually respond. This happens too, but much more rare.
> >>>
> >>> During these stalls, ie, when there's no network activity at all,
> >>> the server NFSD threads are busy eating all available CPU.
> >>>
> >>> What does it all tell us? :)
> >>>
> >>> Thank you!
> >>>
> >>> /mjt
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>> the body of a message to [email protected]
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> Please read the FAQ at http://www.tux.org/lkml/
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>

2012-08-17 16:01:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 10:56:16AM -0400, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote:
> > On 12.07.2012 16:53, J. Bruce Fields wrote:
> > > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> > >> I tried to debug this again, maybe to reproduce in a virtual machine,
> > >> and found out that it is only 32bit server code shows this issue:
> > >> after updating the kernel on the server to 64bit (the same version)
> > >> I can't reproduce this issue anymore. Rebooting back to 32bit,
> > >> and voila, it is here again.
> > >>
> > >> Something apparenlty isn't right on 32bits... ;)
> > >>
> > >> (And yes, the prob is still present and is very annoying :)
> > >
> > > OK, that's very useful, thanks. So probably a bug got introduced in the
> > > 32-bit case between 2.6.32 and 3.0.
> > >
> > > My personal upstream testing is normally all x86_64 only. I'll kick off
> > > a 32-bit install and see if I can reproduce this quickly.
> >
> > Actually it has nothing to do with 32 vs 64 bits as I
> > initially thought. It happens on 64bits too, but takes
> > more time (or data to transfer) to trigger.
>
> That makes it sound like some kind of leak: you're hitting this case
> eventually either way, but it takes longer in the case where you have
> more (low) memory.
>
> I wish I was more familiar with the tcp code.... What number exactly is
> being compared against those limits, and how could we watch it from
> userspace?

Uh, if I grepped my way through this right: it looks like it's the
"memory" column of the "TCP" row of /proc/net/protocols; might be
interesting to see how that's changing over time.

>
> --b.
>
> >
> >
> > > Let me know if you're able to narrow this down any more.
> >
> > I bisected this issue to the following commit:
> >
> > commit f03d78db65085609938fdb686238867e65003181
> > Author: Eric Dumazet <[email protected]>
> > Date: Thu Jul 7 00:27:05 2011 -0700
> >
> > net: refine {udp|tcp|sctp}_mem limits
> >
> > Current tcp/udp/sctp global memory limits are not taking into account
> > hugepages allocations, and allow 50% of ram to be used by buffers of a
> > single protocol [ not counting space used by sockets / inodes ...]
> >
> > Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
> > per protocol, and a minimum of 128 pages.
> > Heavy duty machines sysadmins probably need to tweak limits anyway.
> >
> >
> > Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
> > the behavour here.
> >
> > This machine has 4Gb of memory. On 3.0, with this patch applied
> > (as it is part of 3.0), tcp_mem is like this:
> >
> > 21228 28306 42456
> >
> > with this patch reverted, tcp_mem shows:
> >
> > 81216 108288 162432
> >
> > and with these values, it works fine.
> >
> > So it looks like something else goes wrong there,
> > which lead to all nfsds fighting with each other
> > for something and eating 100% of available CPU
> > instead of servicing clients.
> >
> > For added fun, when setting tcp_mem to the "good" value
> > from "bad" value (after booting into kernel with that
> > patch applied), the problem is _not_ fixed.
> >
> > Any further hints?
> >
> > Thanks,
> >
> > /mjt
> >
> > >> On 31.05.2012 17:51, Michael Tokarev wrote:
> > >>> On 31.05.2012 17:46, Myklebust, Trond wrote:
> > >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> > >>> []
> > >>>>> I started tcpdump:
> > >>>>>
> > >>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> > >>>>>
> > >>>>> on the client (192.168.88.2). Next I mounted a directory on the client,
> > >>>>> and started reading (tar'ing) a directory into /dev/null. It captured a
> > >>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at
> > >>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
> > >>>>>
> > >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> > >>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
> > >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> > >>>>> take place on disk at all. This also makes it obvious that the issue
> > >>>>> does not depend on the speed of disk on the server (since in this case,
> > >>>>> the server disk isn't even in use).
> > >>>>
> > >>>> OK. So from the above file it looks as if the traffic is mainly READ
> > >>>> requests.
> > >>>
> > >>> The issue here happens only with reads.
> > >>>
> > >>>> In 2 places the server stops responding. In both cases, the client seems
> > >>>> to be sending a single TCP frame containing several COMPOUNDS containing
> > >>>> READ requests (which should be legal) just prior to the hang. When the
> > >>>> server doesn't respond, the client pings it with a RENEW, before it ends
> > >>>> up severing the TCP connection and then retransmitting.
> > >>>
> > >>> And sometimes -- speaking only from the behavour I've seen, not from the
> > >>> actual frames sent -- server does not respond to the RENEW too, in which
> > >>> case the client reports "nfs server no responding", and on the next
> > >>> renew it may actually respond. This happens too, but much more rare.
> > >>>
> > >>> During these stalls, ie, when there's no network activity at all,
> > >>> the server NFSD threads are busy eating all available CPU.
> > >>>
> > >>> What does it all tell us? :)
> > >>>
> > >>> Thank you!
> > >>>
> > >>> /mjt
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >>> the body of a message to [email protected]
> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >>> Please read the FAQ at http://www.tux.org/lkml/
> > >>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> >

2012-08-17 17:12:47

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 17.08.2012 20:00, J. Bruce Fields wrote:
[]> Uh, if I grepped my way through this right: it looks like it's the
> "memory" column of the "TCP" row of /proc/net/protocols; might be
> interesting to see how that's changing over time.

This file does not look interesting. Memory usage does not jump,
there's no high increase either.

But there's something else which is interesting here.

I noticed that in perf top, the top consumer of CPU is svc_recv()
(I mentioned this in the start of this thread). So I looked how
this routine is called from nfsd. And here we go.

fs/nfsd/nfssvc.c:

/*
* This is the NFS server kernel thread
*/
static int
nfsd(void *vrqstp)
{
...
/*
* The main request loop
*/
for (;;) {
/*
* Find a socket with data available and call its
* recvfrom routine.
*/
int i = 0;
while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
++i;
printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err);
if (err == -EINTR)
break;
...

(I added the "i" counter and the printk). And here's the output:

[19626.401136] calling svc_recv: 0 times (err=212)
[19626.405059] calling svc_recv: 1478 times (err=212)
[19626.409512] calling svc_recv: 1106 times (err=212)
[19626.543020] calling svc_recv: 0 times (err=212)
[19626.543059] calling svc_recv: 0 times (err=212)
[19626.548074] calling svc_recv: 0 times (err=212)
[19626.549515] calling svc_recv: 0 times (err=212)
[19626.552320] calling svc_recv: 0 times (err=212)
[19626.553503] calling svc_recv: 0 times (err=212)
[19626.556007] calling svc_recv: 0 times (err=212)
[19626.557152] calling svc_recv: 0 times (err=212)
[19626.560109] calling svc_recv: 0 times (err=212)
[19626.560943] calling svc_recv: 0 times (err=212)
[19626.565315] calling svc_recv: 1067 times (err=212)
[19626.569735] calling svc_recv: 2571 times (err=212)
[19626.574150] calling svc_recv: 3842 times (err=212)
[19626.581914] calling svc_recv: 2891 times (err=212)
[19626.583072] calling svc_recv: 1247 times (err=212)
[19626.616885] calling svc_recv: 0 times (err=212)
[19626.616952] calling svc_recv: 0 times (err=212)
[19626.622889] calling svc_recv: 0 times (err=212)
[19626.624518] calling svc_recv: 0 times (err=212)
[19626.627118] calling svc_recv: 0 times (err=212)
[19626.629735] calling svc_recv: 0 times (err=212)
[19626.631777] calling svc_recv: 0 times (err=212)
[19626.633986] calling svc_recv: 0 times (err=212)
[19626.636746] calling svc_recv: 0 times (err=212)
[19626.637692] calling svc_recv: 0 times (err=212)
[19626.640769] calling svc_recv: 0 times (err=212)
[19626.657852] calling svc_recv: 0 times (err=212)
[19626.661602] calling svc_recv: 0 times (err=212)
[19626.670160] calling svc_recv: 0 times (err=212)
[19626.671917] calling svc_recv: 0 times (err=212)
[19626.684643] calling svc_recv: 0 times (err=212)
[19626.684680] calling svc_recv: 0 times (err=212)
[19626.812820] calling svc_recv: 0 times (err=212)
[19626.814697] calling svc_recv: 0 times (err=212)
[19626.817195] calling svc_recv: 0 times (err=212)
[19626.820324] calling svc_recv: 0 times (err=212)
[19626.822855] calling svc_recv: 0 times (err=212)
[19626.824823] calling svc_recv: 0 times (err=212)
[19626.828016] calling svc_recv: 0 times (err=212)
[19626.829021] calling svc_recv: 0 times (err=212)
[19626.831970] calling svc_recv: 0 times (err=212)

> the stall begin:
[19686.823135] calling svc_recv: 3670352 times (err=212)
[19686.823524] calling svc_recv: 3659205 times (err=212)

> transfer continues
[19686.854734] calling svc_recv: 0 times (err=212)
[19686.860023] calling svc_recv: 0 times (err=212)
[19686.887124] calling svc_recv: 0 times (err=212)
[19686.895532] calling svc_recv: 0 times (err=212)
[19686.903667] calling svc_recv: 0 times (err=212)
[19686.922780] calling svc_recv: 0 times (err=212)

So we're calling svc_recv in a tight loop, eating
all available CPU. (The above is with just 2 nfsd
threads).

Something is definitely wrong here. And it happens mure more
often after the mentioned commit (f03d78db65085).

Thanks,

/mjt

2012-08-17 17:19:07

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> On 17.08.2012 20:00, J. Bruce Fields wrote:
> []> Uh, if I grepped my way through this right: it looks like it's the
> > "memory" column of the "TCP" row of /proc/net/protocols; might be
> > interesting to see how that's changing over time.
>
> This file does not look interesting. Memory usage does not jump,
> there's no high increase either.
>
> But there's something else which is interesting here.
>
> I noticed that in perf top, the top consumer of CPU is svc_recv()
> (I mentioned this in the start of this thread). So I looked how
> this routine is called from nfsd. And here we go.
>
> fs/nfsd/nfssvc.c:
>
> /*
> * This is the NFS server kernel thread
> */
> static int
> nfsd(void *vrqstp)
> {
> ...
> /*
> * The main request loop
> */
> for (;;) {
> /*
> * Find a socket with data available and call its
> * recvfrom routine.
> */
> int i = 0;
> while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> ++i;
> printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err);
> if (err == -EINTR)
> break;
> ...
>
> (I added the "i" counter and the printk). And here's the output:
>
> [19626.401136] calling svc_recv: 0 times (err=212)
> [19626.405059] calling svc_recv: 1478 times (err=212)
> [19626.409512] calling svc_recv: 1106 times (err=212)
> [19626.543020] calling svc_recv: 0 times (err=212)
> [19626.543059] calling svc_recv: 0 times (err=212)
> [19626.548074] calling svc_recv: 0 times (err=212)
> [19626.549515] calling svc_recv: 0 times (err=212)
> [19626.552320] calling svc_recv: 0 times (err=212)
> [19626.553503] calling svc_recv: 0 times (err=212)
> [19626.556007] calling svc_recv: 0 times (err=212)
> [19626.557152] calling svc_recv: 0 times (err=212)
> [19626.560109] calling svc_recv: 0 times (err=212)
> [19626.560943] calling svc_recv: 0 times (err=212)
> [19626.565315] calling svc_recv: 1067 times (err=212)
> [19626.569735] calling svc_recv: 2571 times (err=212)
> [19626.574150] calling svc_recv: 3842 times (err=212)
> [19626.581914] calling svc_recv: 2891 times (err=212)
> [19626.583072] calling svc_recv: 1247 times (err=212)
> [19626.616885] calling svc_recv: 0 times (err=212)
> [19626.616952] calling svc_recv: 0 times (err=212)
> [19626.622889] calling svc_recv: 0 times (err=212)
> [19626.624518] calling svc_recv: 0 times (err=212)
> [19626.627118] calling svc_recv: 0 times (err=212)
> [19626.629735] calling svc_recv: 0 times (err=212)
> [19626.631777] calling svc_recv: 0 times (err=212)
> [19626.633986] calling svc_recv: 0 times (err=212)
> [19626.636746] calling svc_recv: 0 times (err=212)
> [19626.637692] calling svc_recv: 0 times (err=212)
> [19626.640769] calling svc_recv: 0 times (err=212)
> [19626.657852] calling svc_recv: 0 times (err=212)
> [19626.661602] calling svc_recv: 0 times (err=212)
> [19626.670160] calling svc_recv: 0 times (err=212)
> [19626.671917] calling svc_recv: 0 times (err=212)
> [19626.684643] calling svc_recv: 0 times (err=212)
> [19626.684680] calling svc_recv: 0 times (err=212)
> [19626.812820] calling svc_recv: 0 times (err=212)
> [19626.814697] calling svc_recv: 0 times (err=212)
> [19626.817195] calling svc_recv: 0 times (err=212)
> [19626.820324] calling svc_recv: 0 times (err=212)
> [19626.822855] calling svc_recv: 0 times (err=212)
> [19626.824823] calling svc_recv: 0 times (err=212)
> [19626.828016] calling svc_recv: 0 times (err=212)
> [19626.829021] calling svc_recv: 0 times (err=212)
> [19626.831970] calling svc_recv: 0 times (err=212)
>
> > the stall begin:
> [19686.823135] calling svc_recv: 3670352 times (err=212)
> [19686.823524] calling svc_recv: 3659205 times (err=212)
>
> > transfer continues
> [19686.854734] calling svc_recv: 0 times (err=212)
> [19686.860023] calling svc_recv: 0 times (err=212)
> [19686.887124] calling svc_recv: 0 times (err=212)
> [19686.895532] calling svc_recv: 0 times (err=212)
> [19686.903667] calling svc_recv: 0 times (err=212)
> [19686.922780] calling svc_recv: 0 times (err=212)
>
> So we're calling svc_recv in a tight loop, eating
> all available CPU. (The above is with just 2 nfsd
> threads).
>
> Something is definitely wrong here. And it happens mure more
> often after the mentioned commit (f03d78db65085).

Oh, neat. Hm. That commit doesn't really sound like the cause, then.
Is that busy-looping reproduceable on kernels before that commit?

--b.

2012-08-17 17:26:36

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 17.08.2012 21:18, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
[]
>> So we're calling svc_recv in a tight loop, eating
>> all available CPU. (The above is with just 2 nfsd
>> threads).
>>
>> Something is definitely wrong here. And it happens mure more
>> often after the mentioned commit (f03d78db65085).
>
> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
> Is that busy-looping reproduceable on kernels before that commit?

Note I bisected this issue to this commit. I haven't seen it
happening before this commit, and reverting it from 3.0 or 3.2
kernel makes the problem to go away.

I guess it is looping there:


net/sunrpc/svc_xprt.c:svc_recv()
...
len = 0;
...
if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
...
} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
... len = <something>
}

/* No data, incomplete (TCP) read, or accept() */
if (len == 0 || len == -EAGAIN)
goto out;
...
out:
rqstp->rq_res.len = 0;
svc_xprt_release(rqstp);
return -EAGAIN;
}

I'm trying to verify this theory...

/mjt

2012-08-17 17:29:51

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 17.08.2012 21:26, Michael Tokarev wrote:
> On 17.08.2012 21:18, J. Bruce Fields wrote:
>> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> []
>>> So we're calling svc_recv in a tight loop, eating
>>> all available CPU. (The above is with just 2 nfsd
>>> threads).
>>>
>>> Something is definitely wrong here. And it happens mure more
>>> often after the mentioned commit (f03d78db65085).
>>
>> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
>> Is that busy-looping reproduceable on kernels before that commit?
>
> Note I bisected this issue to this commit. I haven't seen it
> happening before this commit, and reverting it from 3.0 or 3.2
> kernel makes the problem to go away.
>
> I guess it is looping there:
>
>
> net/sunrpc/svc_xprt.c:svc_recv()
> ...
> len = 0;
> ...
> if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> ...
> } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
> ... len = <something>
> }
>
> /* No data, incomplete (TCP) read, or accept() */
> if (len == 0 || len == -EAGAIN)
> goto out;
> ...
> out:
> rqstp->rq_res.len = 0;
> svc_xprt_release(rqstp);
> return -EAGAIN;
> }
>
> I'm trying to verify this theory...

Yes. I inserted a printk there, and all these million times while
we're waiting in this EAGAIN loop, this printk is triggering:

....
[21052.533053] svc_recv: !has_wspace
[21052.533070] svc_recv: !has_wspace
[21052.533087] svc_recv: !has_wspace
[21052.533105] svc_recv: !has_wspace
[21052.533122] svc_recv: !has_wspace
[21052.533139] svc_recv: !has_wspace
[21052.533156] svc_recv: !has_wspace
[21052.533174] svc_recv: !has_wspace
[21052.533191] svc_recv: !has_wspace
[21052.533208] svc_recv: !has_wspace
[21052.533226] svc_recv: !has_wspace
[21052.533244] svc_recv: !has_wspace
[21052.533265] calling svc_recv: 1228163 times (err=-4)
[21052.533403] calling svc_recv: 1226616 times (err=-4)
[21052.534520] nfsd: last server has exited, flushing export cache

(I stopped nfsd since it was flooding the log).

I can only guess that before that commit, we always had space,
now we don't anymore, and are looping like crazy.

Thanks,

/mjt

2012-08-17 19:18:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote:
> On 17.08.2012 21:26, Michael Tokarev wrote:
> > On 17.08.2012 21:18, J. Bruce Fields wrote:
> >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> > []
> >>> So we're calling svc_recv in a tight loop, eating
> >>> all available CPU. (The above is with just 2 nfsd
> >>> threads).
> >>>
> >>> Something is definitely wrong here. And it happens mure more
> >>> often after the mentioned commit (f03d78db65085).
> >>
> >> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
> >> Is that busy-looping reproduceable on kernels before that commit?
> >
> > Note I bisected this issue to this commit. I haven't seen it
> > happening before this commit, and reverting it from 3.0 or 3.2
> > kernel makes the problem to go away.
> >
> > I guess it is looping there:
> >
> >
> > net/sunrpc/svc_xprt.c:svc_recv()
> > ...
> > len = 0;
> > ...
> > if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> > ...
> > } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
> > ... len = <something>
> > }
> >
> > /* No data, incomplete (TCP) read, or accept() */
> > if (len == 0 || len == -EAGAIN)
> > goto out;
> > ...
> > out:
> > rqstp->rq_res.len = 0;
> > svc_xprt_release(rqstp);
> > return -EAGAIN;
> > }
> >
> > I'm trying to verify this theory...
>
> Yes. I inserted a printk there, and all these million times while
> we're waiting in this EAGAIN loop, this printk is triggering:
>
> ....
> [21052.533053] svc_recv: !has_wspace
> [21052.533070] svc_recv: !has_wspace
> [21052.533087] svc_recv: !has_wspace
> [21052.533105] svc_recv: !has_wspace
> [21052.533122] svc_recv: !has_wspace
> [21052.533139] svc_recv: !has_wspace
> [21052.533156] svc_recv: !has_wspace
> [21052.533174] svc_recv: !has_wspace
> [21052.533191] svc_recv: !has_wspace
> [21052.533208] svc_recv: !has_wspace
> [21052.533226] svc_recv: !has_wspace
> [21052.533244] svc_recv: !has_wspace
> [21052.533265] calling svc_recv: 1228163 times (err=-4)
> [21052.533403] calling svc_recv: 1226616 times (err=-4)
> [21052.534520] nfsd: last server has exited, flushing export cache
>
> (I stopped nfsd since it was flooding the log).
>
> I can only guess that before that commit, we always had space,
> now we don't anymore, and are looping like crazy.

Thanks! But, arrgh--that should be enough to go on at this point, but
I'm not seeing it. If has_wspace is returning false then it's likely
also returning false to the call at the start of svc_xprt_enqueue() (see
svc_xprt_has_something_to_do), which means the xprt shouldn't be getting
requeued and the next svc_recv call should find no socket ready (so
svc_xprt_dequeue() returns NULL), and goes to sleep.

But clearly it's not working that way....

--b.

2012-08-17 20:08:19

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 03:18:00PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote:
> > On 17.08.2012 21:26, Michael Tokarev wrote:
> > > On 17.08.2012 21:18, J. Bruce Fields wrote:
> > >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> > > []
> > >>> So we're calling svc_recv in a tight loop, eating
> > >>> all available CPU. (The above is with just 2 nfsd
> > >>> threads).
> > >>>
> > >>> Something is definitely wrong here. And it happens mure more
> > >>> often after the mentioned commit (f03d78db65085).
> > >>
> > >> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
> > >> Is that busy-looping reproduceable on kernels before that commit?
> > >
> > > Note I bisected this issue to this commit. I haven't seen it
> > > happening before this commit, and reverting it from 3.0 or 3.2
> > > kernel makes the problem to go away.
> > >
> > > I guess it is looping there:
> > >
> > >
> > > net/sunrpc/svc_xprt.c:svc_recv()
> > > ...
> > > len = 0;
> > > ...
> > > if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> > > ...
> > > } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
> > > ... len = <something>
> > > }
> > >
> > > /* No data, incomplete (TCP) read, or accept() */
> > > if (len == 0 || len == -EAGAIN)
> > > goto out;
> > > ...
> > > out:
> > > rqstp->rq_res.len = 0;
> > > svc_xprt_release(rqstp);
> > > return -EAGAIN;
> > > }
> > >
> > > I'm trying to verify this theory...
> >
> > Yes. I inserted a printk there, and all these million times while
> > we're waiting in this EAGAIN loop, this printk is triggering:
> >
> > ....
> > [21052.533053] svc_recv: !has_wspace
> > [21052.533070] svc_recv: !has_wspace
> > [21052.533087] svc_recv: !has_wspace
> > [21052.533105] svc_recv: !has_wspace
> > [21052.533122] svc_recv: !has_wspace
> > [21052.533139] svc_recv: !has_wspace
> > [21052.533156] svc_recv: !has_wspace
> > [21052.533174] svc_recv: !has_wspace
> > [21052.533191] svc_recv: !has_wspace
> > [21052.533208] svc_recv: !has_wspace
> > [21052.533226] svc_recv: !has_wspace
> > [21052.533244] svc_recv: !has_wspace
> > [21052.533265] calling svc_recv: 1228163 times (err=-4)
> > [21052.533403] calling svc_recv: 1226616 times (err=-4)
> > [21052.534520] nfsd: last server has exited, flushing export cache
> >
> > (I stopped nfsd since it was flooding the log).
> >
> > I can only guess that before that commit, we always had space,
> > now we don't anymore, and are looping like crazy.
>
> Thanks! But, arrgh--that should be enough to go on at this point, but
> I'm not seeing it. If has_wspace is returning false then it's likely
> also returning false to the call at the start of svc_xprt_enqueue()

Wait a minute, that assumption's a problem because that calculation
depends in part on xpt_reserved, which is changed here....

In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
lower xpt_reserved value. That could well explain this.

--b.

> (see
> svc_xprt_has_something_to_do), which means the xprt shouldn't be getting
> requeued and the next svc_recv call should find no socket ready (so
> svc_xprt_dequeue() returns NULL), and goes to sleep.
>
> But clearly it's not working that way....
>
> --b.

2012-08-17 22:33:04

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
> Wait a minute, that assumption's a problem because that calculation
> depends in part on xpt_reserved, which is changed here....
>
> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
> lower xpt_reserved value. That could well explain this.

So, maybe something like this?

--b.

commit c8136c319ad85d0db870021fc3f9074d37f26d4a
Author: J. Bruce Fields <[email protected]>
Date: Fri Aug 17 17:31:53 2012 -0400

svcrpc: don't add to xpt_reserved till we receive

The rpc server tries to ensure that there will be room to send a reply
before it receives a request.

It does this by tracking, in xpt_reserved, an upper bound on the total
size of the replies that is has already committed to for the socket.

Currently it is adding in the estimate for a new reply *before* it
checks whether there is space available. If it finds that there is not
space, it then subtracts the estimate back out.

This may lead the subsequent svc_xprt_enqueue to decide that there is
space after all.

The results is a svc_recv() that will repeatedly return -EAGAIN, causing
server threads to loop without doing any actual work.

Reported-by: Michael Tokarev <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ec99849a..59ff3a3 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
rqstp, rqstp->rq_xprt);
rqstp->rq_xprt = xprt;
svc_xprt_get(xprt);
- rqstp->rq_reserved = serv->sv_max_mesg;
- atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
pool->sp_stats.threads_woken++;
wake_up(&rqstp->rq_wait);
} else {
@@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
if (xprt) {
rqstp->rq_xprt = xprt;
svc_xprt_get(xprt);
- rqstp->rq_reserved = serv->sv_max_mesg;
- atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);

/* As there is a shortage of threads and this request
* had to be queued, don't allow the thread to wait so
@@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
len = xprt->xpt_ops->xpo_recvfrom(rqstp);
dprintk("svc: got len=%d\n", len);
}
+ if (len > 0) {
+ rqstp->rq_reserved = serv->sv_max_mesg;
+ atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
+ }
svc_xprt_received(xprt);

/* No data, incomplete (TCP) read, or accept() */

2012-08-18 06:49:44

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 18.08.2012 02:32, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
>> Wait a minute, that assumption's a problem because that calculation
>> depends in part on xpt_reserved, which is changed here....
>>
>> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
>> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
>> lower xpt_reserved value. That could well explain this.
>
> So, maybe something like this?

Well. What can I say? With the change below applied (to 3.2 kernel
at least), I don't see any stalls or high CPU usage on the server
anymore. It survived several multi-gigabyte transfers, for several
hours, without any problem. So it is a good step forward ;)

But the whole thing seems to be quite a bit fragile. I tried to follow
the logic in there, and the thing is quite a bit, well, "twisted", and
somewhat difficult to follow. So I don't know if this is the right
fix or not. At least it works! :)

And I really wonder why no one else reported this problem before.
Is me the only one in this world who uses linux nfsd? :)

Thank you for all your patience and the proposed fix!

/mjt

> commit c8136c319ad85d0db870021fc3f9074d37f26d4a
> Author: J. Bruce Fields <[email protected]>
> Date: Fri Aug 17 17:31:53 2012 -0400
>
> svcrpc: don't add to xpt_reserved till we receive
>
> The rpc server tries to ensure that there will be room to send a reply
> before it receives a request.
>
> It does this by tracking, in xpt_reserved, an upper bound on the total
> size of the replies that is has already committed to for the socket.
>
> Currently it is adding in the estimate for a new reply *before* it
> checks whether there is space available. If it finds that there is not
> space, it then subtracts the estimate back out.
>
> This may lead the subsequent svc_xprt_enqueue to decide that there is
> space after all.
>
> The results is a svc_recv() that will repeatedly return -EAGAIN, causing
> server threads to loop without doing any actual work.
>
> Reported-by: Michael Tokarev <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index ec99849a..59ff3a3 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> rqstp, rqstp->rq_xprt);
> rqstp->rq_xprt = xprt;
> svc_xprt_get(xprt);
> - rqstp->rq_reserved = serv->sv_max_mesg;
> - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> pool->sp_stats.threads_woken++;
> wake_up(&rqstp->rq_wait);
> } else {
> @@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
> if (xprt) {
> rqstp->rq_xprt = xprt;
> svc_xprt_get(xprt);
> - rqstp->rq_reserved = serv->sv_max_mesg;
> - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>
> /* As there is a shortage of threads and this request
> * had to be queued, don't allow the thread to wait so
> @@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
> len = xprt->xpt_ops->xpo_recvfrom(rqstp);
> dprintk("svc: got len=%d\n", len);
> }
> + if (len > 0) {
> + rqstp->rq_reserved = serv->sv_max_mesg;
> + atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> + }
> svc_xprt_received(xprt);
>
> /* No data, incomplete (TCP) read, or accept() */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-08-18 11:13:59

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote:
> On 18.08.2012 02:32, J. Bruce Fields wrote:
> > On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
> >> Wait a minute, that assumption's a problem because that calculation
> >> depends in part on xpt_reserved, which is changed here....
> >>
> >> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
> >> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
> >> lower xpt_reserved value. That could well explain this.
> >
> > So, maybe something like this?
>
> Well. What can I say? With the change below applied (to 3.2 kernel
> at least), I don't see any stalls or high CPU usage on the server
> anymore. It survived several multi-gigabyte transfers, for several
> hours, without any problem. So it is a good step forward ;)
>
> But the whole thing seems to be quite a bit fragile. I tried to follow
> the logic in there, and the thing is quite a bit, well, "twisted", and
> somewhat difficult to follow. So I don't know if this is the right
> fix or not. At least it works! :)

Suggestions welcomed.

> And I really wonder why no one else reported this problem before.
> Is me the only one in this world who uses linux nfsd? :)

This, for example:

http://marc.info/?l=linux-nfs&m=134131915612287&w=2

may well describe the same problem.... It just needed some debugging
persistence, thanks!

--b.

2012-08-18 12:58:36

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 18.08.2012 15:13, J. Bruce Fields wrote:
> On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote:
[]
>> Well. What can I say? With the change below applied (to 3.2 kernel
>> at least), I don't see any stalls or high CPU usage on the server
>> anymore. It survived several multi-gigabyte transfers, for several
>> hours, without any problem. So it is a good step forward ;)
>>
>> But the whole thing seems to be quite a bit fragile. I tried to follow
>> the logic in there, and the thing is quite a bit, well, "twisted", and
>> somewhat difficult to follow. So I don't know if this is the right
>> fix or not. At least it works! :)
>
> Suggestions welcomed.

Ok...

Meanwhile, you can add my
Tested-By: Michael Tokarev <[email protected]>

to the patch.

>> And I really wonder why no one else reported this problem before.
>> Is me the only one in this world who uses linux nfsd? :)
>
> This, for example:
>
> http://marc.info/?l=linux-nfs&m=134131915612287&w=2
>
> may well describe the same problem.... It just needed some debugging
> persistence, thanks!

Ah. I tried to find something when I initially
sent this report, but weren't able to. Apparently
I'm not alone with this problem indeed!

Thank you for all the work!

/mjt