2012-05-25 06:53:16

by Michael Tokarev

[permalink] [raw]
Subject: 3.0+ NFS issues

Hello.

I updated my nfs server machine to kernel 3.0, and
noticed that its main usage become, well, problematic.

While trying to dig deeper, I also found a few other
interesting issues, which are mentioned below.

But first thing first: nfs.

i686pae kernel, lots of RAM, Atom-based (cedar trail)
machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
(I will try current 3.4 but I don't have much hopes
there). NFSv4.

When a client machine (also 3.0 kernel) does some reading,
the process often stalls somewhere in the read syscall,
or, rarer, during close, for up to two MINUTES. During
this time, the client (kernel) reports "NFS server <foo>
does not respond" several times, and finally "NFS server
<foo> ok", client process "unstucks" from the read(2),
and is able to perform a few more reads till the whole
thing repeats.

While this happens, ie, while the client process is stuck,
any access to the mounted filesystem results in an
uninterruptible sleep - all processes including the
initial client are unkillable. This is very annoying,
but I was never able to fix this, including the real
case when the server becomes unreachable - "intr" mount
option which is supposed to help here makes no difference.
This is first and longstanding issue, but it is not really
related to the issue at hand.

On the server side things are more interesting. During
the client stall, all NFSD threads are busy running,
using all available CPU. The machine has 2 cores, and
if I reduce number of NFSD threads to 1, one core will
be used by the single NFSD to 100%. With default 8
NFSDs, both cores are used, by 4 NFSD threads each.
This is according to top(1), which show proportional
and equal CPU activity for each of NFSD thread.

But. And this is where things becomes more interesting.
perf top shows almost no activity with any number of NFSD
threads but one. For example, this is a typical perf top
display with 2 NFSD threads:

samples pcnt function DSO
1558.00 6.4% svc_recv [sunrpc]
1084.00 5.2% __schedule [kernel]
1065.00 5.1% select_task_rq_fair [kernel]
692.00 3.3% try_to_wake_up [kernel]


Yes, there's some 6.4% time spent in svc_recv, but it is
FAR from the 200% CPU usage reported by top(1). Now, when
I reduce number of NFSD threds to 1, perf top output becomes
dramatically different:

samples pcnt function DSO
5996.00 59.2% svc_recv [sunrpc]
750.00 7.4% cache_check [sunrpc]
519.00 5.1% svc_tcp_has_wspace [sunrpc]
465.00 4.6% local_bh_enable_ip [kernel.kallsyms]
301.00 3.0% svc_defer [sunrpc]
294.00 2.9% page_address [kernel.kallsyms]

Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times
the difference.

So, apparently, `perf top' tool does not show the truth.

Trying to catch this high CPU usage with Sysrq+T does not
show anything interesting, the NFSD thread is running (and
is the only thread(s) shown in the "Runnable tasks" output):


May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd R running 0 5853 2 0x00000000
May 25 10:48:37 gnome vmunix: [ 1709.483796] f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140
May 25 10:48:37 gnome vmunix: [ 1709.483841] 050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575
May 25 10:48:37 gnome vmunix: [ 1709.483887] c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000
May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace:
May 25 10:48:37 gnome vmunix: [ 1709.483950] [<c1048575>] ? irq_exit+0x35/0x90
May 25 10:48:37 gnome vmunix: [ 1709.483973] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
May 25 10:48:37 gnome vmunix: [ 1709.483997] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484029] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484053] [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484084] [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484107] [<c103d3e3>] ? __cond_resched+0x13/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484129] [<c12f92b5>] ? _cond_resched+0x25/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484159] [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484182] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
May 25 10:48:37 gnome vmunix: [ 1709.484206] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
May 25 10:48:37 gnome vmunix: [ 1709.484233] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
May 25 10:48:37 gnome vmunix: [ 1709.484255] [<c102d240>] ? complete+0x40/0x60
May 25 10:48:37 gnome vmunix: [ 1709.484279] [<f90ea000>] ? 0xf90e9fff
May 25 10:48:37 gnome vmunix: [ 1709.484299] [<c105ea99>] ? kthread+0x69/0x70
May 25 10:48:37 gnome vmunix: [ 1709.484321] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
May 25 10:48:37 gnome vmunix: [ 1709.484345] [<c1301876>] ? kernel_thread_helper+0x6/0xd

May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd R running 0 5853 2 0x00000000
May 25 10:49:37 gnome vmunix: [ 1769.982816] f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c
May 25 10:49:37 gnome vmunix: [ 1769.982859] 7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575
May 25 10:49:37 gnome vmunix: [ 1769.982903] f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c
May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace:
May 25 10:49:37 gnome vmunix: [ 1769.982966] [<c1048575>] ? irq_exit+0x35/0x90
May 25 10:49:37 gnome vmunix: [ 1769.982987] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
May 25 10:49:37 gnome vmunix: [ 1769.983019] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983050] [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983081] [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983112] [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983143] [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983166] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
May 25 10:49:37 gnome vmunix: [ 1769.983193] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
May 25 10:49:37 gnome vmunix: [ 1769.983215] [<c102d240>] ? complete+0x40/0x60
May 25 10:49:37 gnome vmunix: [ 1769.983238] [<f90ea000>] ? 0xf90e9fff
May 25 10:49:37 gnome vmunix: [ 1769.983258] [<c105ea99>] ? kthread+0x69/0x70
May 25 10:49:37 gnome vmunix: [ 1769.983278] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
May 25 10:49:37 gnome vmunix: [ 1769.983301] [<c1301876>] ? kernel_thread_helper+0x6/0xd

This is 3.2, on 3.0 it was in different place.

This problem is persistent, always reproducible, happens
on reads from the server (I haven't seen it with writes),
and completely goes away when reverting to the kernel I
previously used on this machine - 2.6.32. But unfortunately
2.6.32 does not quite work on the new hardware (I replaced
the motherboard), due to lack of drivers for several pieces.
(This is a different story, I think NFSD should work in 3.0
regardless of 2.6.32 issues :)

Anything I can do to make NFS work again?

Thanks,

/mjt


2012-05-29 15:24:19

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
> I updated my nfs server machine to kernel 3.0, and
> noticed that its main usage become, well, problematic.
>
> While trying to dig deeper, I also found a few other
> interesting issues, which are mentioned below.
>
> But first thing first: nfs.
>
> i686pae kernel, lots of RAM, Atom-based (cedar trail)
> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
> (I will try current 3.4 but I don't have much hopes
> there). NFSv4.
>
> When a client machine (also 3.0 kernel) does some reading,
> the process often stalls somewhere in the read syscall,
> or, rarer, during close, for up to two MINUTES. During
> this time, the client (kernel) reports "NFS server <foo>
> does not respond" several times, and finally "NFS server
> <foo> ok", client process "unstucks" from the read(2),
> and is able to perform a few more reads till the whole
> thing repeats.

You say 2.6.32 was OK; have you tried anything else between?

And you're holding the client constant while varying only the server
version, right?

Is your network otherwise working? (E.g. does transferring a bunch of
data from server to client using some other protocol work reliably?)

Is there anything happening on the network during these stalls? (You
can watch the network with wireshark, for example.)

Does NFSv3 behave the same way?

--b.

>
> While this happens, ie, while the client process is stuck,
> any access to the mounted filesystem results in an
> uninterruptible sleep - all processes including the
> initial client are unkillable. This is very annoying,
> but I was never able to fix this, including the real
> case when the server becomes unreachable - "intr" mount
> option which is supposed to help here makes no difference.
> This is first and longstanding issue, but it is not really
> related to the issue at hand.
>
> On the server side things are more interesting. During
> the client stall, all NFSD threads are busy running,
> using all available CPU. The machine has 2 cores, and
> if I reduce number of NFSD threads to 1, one core will
> be used by the single NFSD to 100%. With default 8
> NFSDs, both cores are used, by 4 NFSD threads each.
> This is according to top(1), which show proportional
> and equal CPU activity for each of NFSD thread.
>
> But. And this is where things becomes more interesting.
> perf top shows almost no activity with any number of NFSD
> threads but one. For example, this is a typical perf top
> display with 2 NFSD threads:
>
> samples pcnt function DSO
> 1558.00 6.4% svc_recv [sunrpc]
> 1084.00 5.2% __schedule [kernel]
> 1065.00 5.1% select_task_rq_fair [kernel]
> 692.00 3.3% try_to_wake_up [kernel]
>
>
> Yes, there's some 6.4% time spent in svc_recv, but it is
> FAR from the 200% CPU usage reported by top(1). Now, when
> I reduce number of NFSD threds to 1, perf top output becomes
> dramatically different:
>
> samples pcnt function DSO
> 5996.00 59.2% svc_recv [sunrpc]
> 750.00 7.4% cache_check [sunrpc]
> 519.00 5.1% svc_tcp_has_wspace [sunrpc]
> 465.00 4.6% local_bh_enable_ip [kernel.kallsyms]
> 301.00 3.0% svc_defer [sunrpc]
> 294.00 2.9% page_address [kernel.kallsyms]
>
> Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times
> the difference.
>
> So, apparently, `perf top' tool does not show the truth.
>
> Trying to catch this high CPU usage with Sysrq+T does not
> show anything interesting, the NFSD thread is running (and
> is the only thread(s) shown in the "Runnable tasks" output):
>
>
> May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd R running 0 5853 2 0x00000000
> May 25 10:48:37 gnome vmunix: [ 1709.483796] f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140
> May 25 10:48:37 gnome vmunix: [ 1709.483841] 050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575
> May 25 10:48:37 gnome vmunix: [ 1709.483887] c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000
> May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace:
> May 25 10:48:37 gnome vmunix: [ 1709.483950] [<c1048575>] ? irq_exit+0x35/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.483973] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.483997] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484029] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484053] [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484084] [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484107] [<c103d3e3>] ? __cond_resched+0x13/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484129] [<c12f92b5>] ? _cond_resched+0x25/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484159] [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484182] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.484206] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
> May 25 10:48:37 gnome vmunix: [ 1709.484233] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
> May 25 10:48:37 gnome vmunix: [ 1709.484255] [<c102d240>] ? complete+0x40/0x60
> May 25 10:48:37 gnome vmunix: [ 1709.484279] [<f90ea000>] ? 0xf90e9fff
> May 25 10:48:37 gnome vmunix: [ 1709.484299] [<c105ea99>] ? kthread+0x69/0x70
> May 25 10:48:37 gnome vmunix: [ 1709.484321] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
> May 25 10:48:37 gnome vmunix: [ 1709.484345] [<c1301876>] ? kernel_thread_helper+0x6/0xd
>
> May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd R running 0 5853 2 0x00000000
> May 25 10:49:37 gnome vmunix: [ 1769.982816] f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c
> May 25 10:49:37 gnome vmunix: [ 1769.982859] 7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575
> May 25 10:49:37 gnome vmunix: [ 1769.982903] f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c
> May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace:
> May 25 10:49:37 gnome vmunix: [ 1769.982966] [<c1048575>] ? irq_exit+0x35/0x90
> May 25 10:49:37 gnome vmunix: [ 1769.982987] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
> May 25 10:49:37 gnome vmunix: [ 1769.983019] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983050] [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983081] [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983112] [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983143] [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983166] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
> May 25 10:49:37 gnome vmunix: [ 1769.983193] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
> May 25 10:49:37 gnome vmunix: [ 1769.983215] [<c102d240>] ? complete+0x40/0x60
> May 25 10:49:37 gnome vmunix: [ 1769.983238] [<f90ea000>] ? 0xf90e9fff
> May 25 10:49:37 gnome vmunix: [ 1769.983258] [<c105ea99>] ? kthread+0x69/0x70
> May 25 10:49:37 gnome vmunix: [ 1769.983278] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
> May 25 10:49:37 gnome vmunix: [ 1769.983301] [<c1301876>] ? kernel_thread_helper+0x6/0xd
>
> This is 3.2, on 3.0 it was in different place.
>
> This problem is persistent, always reproducible, happens
> on reads from the server (I haven't seen it with writes),
> and completely goes away when reverting to the kernel I
> previously used on this machine - 2.6.32. But unfortunately
> 2.6.32 does not quite work on the new hardware (I replaced
> the motherboard), due to lack of drivers for several pieces.
> (This is a different story, I think NFSD should work in 3.0
> regardless of 2.6.32 issues :)
>
> Anything I can do to make NFS work again?
>
> Thanks,
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-05-30 07:11:46

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 29.05.2012 19:24, J. Bruce Fields wrote:
> On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
>> I updated my nfs server machine to kernel 3.0, and
>> noticed that its main usage become, well, problematic.
>>
>> While trying to dig deeper, I also found a few other
>> interesting issues, which are mentioned below.
>>
>> But first thing first: nfs.
>>
>> i686pae kernel, lots of RAM, Atom-based (cedar trail)
>> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
>> (I will try current 3.4 but I don't have much hopes
>> there). NFSv4.
>>
>> When a client machine (also 3.0 kernel) does some reading,
>> the process often stalls somewhere in the read syscall,
>> or, rarer, during close, for up to two MINUTES. During
>> this time, the client (kernel) reports "NFS server <foo>
>> does not respond" several times, and finally "NFS server
>> <foo> ok", client process "unstucks" from the read(2),
>> and is able to perform a few more reads till the whole
>> thing repeats.
>
> You say 2.6.32 was OK; have you tried anything else between?

Well, I thought bisecting between 2.6.32 and 3.0 will be quite
painful... But I'll try if nothing else helps. And no, I haven't
tried anything in-between.

> And you're holding the client constant while varying only the server
> version, right?

Yes.

> Is your network otherwise working? (E.g. does transferring a bunch of
> data from server to client using some other protocol work reliably?)

Yes, it works flawlessly, all other protocols works so far.

To the date, I resorted to using a small webserver plus wget as an ugly
workaround for the problem - http works for reads from the server, while
nfs works for writes.

> Is there anything happening on the network during these stalls? (You
> can watch the network with wireshark, for example.)

The network load is irrelevant - it behaves the same way with
100% idle network or with network busy doing other stuff.

> Does NFSv3 behave the same way?

Yes it does. With all NFSDs on server eating all available CPUs for
quite some time, and with being "ghosts" for perf top.

And with the client being unkillable again.

Can at least the client be made interruptible? Mounting with
-o intr,soft makes no visible difference...

Thanks,

/mjt

2012-05-30 13:25:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote:
> On 29.05.2012 19:24, J. Bruce Fields wrote:
> > On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
> >> I updated my nfs server machine to kernel 3.0, and
> >> noticed that its main usage become, well, problematic.
> >>
> >> While trying to dig deeper, I also found a few other
> >> interesting issues, which are mentioned below.
> >>
> >> But first thing first: nfs.
> >>
> >> i686pae kernel, lots of RAM, Atom-based (cedar trail)
> >> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
> >> (I will try current 3.4 but I don't have much hopes
> >> there). NFSv4.
> >>
> >> When a client machine (also 3.0 kernel) does some reading,
> >> the process often stalls somewhere in the read syscall,
> >> or, rarer, during close, for up to two MINUTES. During
> >> this time, the client (kernel) reports "NFS server <foo>
> >> does not respond" several times, and finally "NFS server
> >> <foo> ok", client process "unstucks" from the read(2),
> >> and is able to perform a few more reads till the whole
> >> thing repeats.
> >
> > You say 2.6.32 was OK; have you tried anything else between?
>
> Well, I thought bisecting between 2.6.32 and 3.0 will be quite
> painful... But I'll try if nothing else helps. And no, I haven't
> tried anything in-between.
>
> > And you're holding the client constant while varying only the server
> > version, right?
>
> Yes.
>
> > Is your network otherwise working? (E.g. does transferring a bunch of
> > data from server to client using some other protocol work reliably?)
>
> Yes, it works flawlessly, all other protocols works so far.
>
> To the date, I resorted to using a small webserver plus wget as an ugly
> workaround for the problem - http works for reads from the server, while
> nfs works for writes.
>
> > Is there anything happening on the network during these stalls? (You
> > can watch the network with wireshark, for example.)
>
> The network load is irrelevant - it behaves the same way with
> 100% idle network or with network busy doing other stuff.

That's not what I meant. During one of these read stalls, if you watch
the network with wireshark, do you see any NFS traffic between the
client and server?

Also: do you have a reliable way of reproducing this quickly?

--b.

> > Does NFSv3 behave the same way?
>
> Yes it does. With all NFSDs on server eating all available CPUs for
> quite some time, and with being "ghosts" for perf top.
>
> And with the client being unkillable again.
>
> Can at least the client be made interruptible? Mounting with
> -o intr,soft makes no visible difference...
>
> Thanks,
>
> /mjt

2012-05-31 06:47:14

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 30.05.2012 17:25, J. Bruce Fields wrote:
> On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote:
[]
> That's not what I meant. During one of these read stalls, if you watch
> the network with wireshark, do you see any NFS traffic between the
> client and server?

Oh. Indeed, I misunderstood.

And no, during these stalls, there's no network activity at all.
Here's the typical scenario:

...
10:38:53.781990 IP (tos 0x0, ttl 64, id 35131, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xb43d), ack 89530281, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782000 IP (tos 0x0, ttl 64, id 6329, offset 0, flags [DF], proto TCP (6), length 1500)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe827 (correct), seq 89530281:89531729, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.782027 IP (tos 0x0, ttl 64, id 6330, offset 0, flags [DF], proto TCP (6), length 1708)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x37f6 (incorrect -> 0x6790), seq 89531729:89533385, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1656
10:38:53.782029 IP (tos 0x0, ttl 64, id 35132, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xa81d), ack 89533385, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782040 IP (tos 0x0, ttl 64, id 6333, offset 0, flags [DF], proto TCP (6), length 1500)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x0d5b (correct), seq 89534833:89536281, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.782082 IP (tos 0x0, ttl 64, id 6334, offset 0, flags [DF], proto TCP (6), length 4396)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x4276 (incorrect -> 0x778a), seq 89536281:89540625, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 4344
10:38:53.782088 IP (tos 0x0, ttl 64, id 35134, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0x8bd5), ack 89540625, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782096 IP (tos 0x0, ttl 64, id 6337, offset 0, flags [DF], proto TCP (6), length 1500)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x835d (correct), seq 89540625:89542073, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.827355 IP (tos 0x0, ttl 64, id 35160, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1396548098 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827379 IP (tos 0x0, ttl 64, id 35161, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1413325314 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827385 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1430102530 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827400 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1446879746 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827406 IP (tos 0x0, ttl 64, id 35164, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1463656962 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827409 IP (tos 0x0, ttl 64, id 35165, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1480434178 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827413 IP (tos 0x0, ttl 64, id 35166, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1497211394 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827417 IP (tos 0x0, ttl 64, id 35167, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1513988610 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827420 IP (tos 0x0, ttl 64, id 35168, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1530765826 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827424 IP (tos 0x0, ttl 64, id 35169, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1547543042 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827427 IP (tos 0x0, ttl 64, id 35170, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1564320258 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827434 IP (tos 0x0, ttl 64, id 35171, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1581097474 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827438 IP (tos 0x0, ttl 64, id 35172, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1597874690 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827443 IP (tos 0x0, ttl 64, id 35173, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1614651906 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827447 IP (tos 0x0, ttl 64, id 35174, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1631429122 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827673 IP (tos 0x0, ttl 64, id 6428, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe4e5 (correct), ack 41617, win 6289, options [nop,nop,TS val 122195221 ecr 3298142], length 0
10:38:53.827699 IP (tos 0x0, ttl 64, id 6429, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdfd4 (correct), ack 42913, win 6289, options [nop,nop,TS val 122195221 ecr 3298143], length 0
10:38:53.865036 IP (tos 0x0, ttl 64, id 6430, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdd40 (correct), ack 43561, win 6289, options [nop,nop,TS val 122195233 ecr 3298143], length 0
[pause]
^C

192.168.88.2 is the client, .4 is the server.

I'm not sure if the series of getattr requests from the client is
right before or right after the beginning of the stall, but after
the 3 last replies from server there's no other activity for a
long time, and the server is eating 100% available CPU as I described
previously.

> Also: do you have a reliable way of reproducing this quickly?

Yes, it is enough to start copying any large file and in a few
seconds the first stall happens.

Can you suggest something for the other part of the question:

>> Can at least the client be made interruptible? Mounting with
>> -o intr,soft makes no visible difference...

please? :)

Thank you!

/mjt

2012-05-31 12:59:34

by Myklebust, Trond

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On Thu, 2012-05-31 at 10:47 +0400, Michael Tokarev wrote:

> 10:38:53.827355 IP (tos 0x0, ttl 64, id 35160, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1396548098 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827379 IP (tos 0x0, ttl 64, id 35161, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1413325314 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827385 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1430102530 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827400 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1446879746 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827406 IP (tos 0x0, ttl 64, id 35164, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1463656962 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827409 IP (tos 0x0, ttl 64, id 35165, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1480434178 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827413 IP (tos 0x0, ttl 64, id 35166, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1497211394 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827417 IP (tos 0x0, ttl 64, id 35167, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1513988610 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827420 IP (tos 0x0, ttl 64, id 35168, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1530765826 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827424 IP (tos 0x0, ttl 64, id 35169, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1547543042 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827427 IP (tos 0x0, ttl 64, id 35170, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1564320258 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827434 IP (tos 0x0, ttl 64, id 35171, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1581097474 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827438 IP (tos 0x0, ttl 64, id 35172, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1597874690 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827443 IP (tos 0x0, ttl 64, id 35173, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1614651906 > 192.168.88.4.2049: 212 getattr fh 0,0/22
> 10:38:53.827447 IP (tos 0x0, ttl 64, id 35174, offset 0, flags [DF], proto TCP (6), length 268)
> 192.168.88.2.1631429122 > 192.168.88.4.2049: 212 getattr fh 0,0/22

That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3.

Can you either please use wireshark to provide a full text dump (using
something like 'tshark -V -O nfs,rpc'), or just send us the binary
tcpdump output using 'tcpdump -w /tmp/foo -s 90000'?

> >> Can at least the client be made interruptible? Mounting with
> >> -o intr,soft makes no visible difference...
>
> please? :)

It already is interruptible: try 'kill -9' or any other fatal signal.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-05-31 13:25:04

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 31.05.2012 16:59, Myklebust, Trond wrote:
[]
> That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3.

Oh.

> Can you either please use wireshark to provide a full text dump (using
> something like 'tshark -V -O nfs,rpc'), or just send us the binary
> tcpdump output using 'tcpdump -w /tmp/foo -s 90000'?

I started tcpdump:

tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump

on the client (192.168.88.2). Next I mounted a directory on the client,
and started reading (tar'ing) a directory into /dev/null. It captured a
few stalls. Tcpdump shows number of packets it got, the stalls are at
packet counts 58090, 97069 and 97071. I cancelled the capture after that.

The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
it is 220Mb uncompressed and 1.3Mb compressed. The source files are
10 files of 1Gb each, all made by using `truncate' utility, so does not
take place on disk at all. This also makes it obvious that the issue
does not depend on the speed of disk on the server (since in this case,
the server disk isn't even in use).

What I noticed is: with default 8 NFSD threads, it takes quite a bit
longer to hit this issue, but it still happens. The capture was
done using 2 threads.

>>>> Can at least the client be made interruptible? Mounting with
>>>> -o intr,soft makes no visible difference...
>>
>> please? :)
>
> It already is interruptible: try 'kill -9' or any other fatal signal.

Oh. Indeed, I can kill it from another terminal using -9. That is
great already!

Thank you!

/mjt

2012-05-31 13:46:34

by Myklebust, Trond

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> On 31.05.2012 16:59, Myklebust, Trond wrote:
> []
> > That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3.
>
> Oh.
>
> > Can you either please use wireshark to provide a full text dump (using
> > something like 'tshark -V -O nfs,rpc'), or just send us the binary
> > tcpdump output using 'tcpdump -w /tmp/foo -s 90000'?
>
> I started tcpdump:
>
> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>
> on the client (192.168.88.2). Next I mounted a directory on the client,
> and started reading (tar'ing) a directory into /dev/null. It captured a
> few stalls. Tcpdump shows number of packets it got, the stalls are at
> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
>
> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> take place on disk at all. This also makes it obvious that the issue
> does not depend on the speed of disk on the server (since in this case,
> the server disk isn't even in use).

OK. So from the above file it looks as if the traffic is mainly READ
requests.

In 2 places the server stops responding. In both cases, the client seems
to be sending a single TCP frame containing several COMPOUNDS containing
READ requests (which should be legal) just prior to the hang. When the
server doesn't respond, the client pings it with a RENEW, before it ends
up severing the TCP connection and then retransmitting.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-05-31 13:51:22

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 31.05.2012 17:46, Myklebust, Trond wrote:
> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
[]
>> I started tcpdump:
>>
>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>
>> on the client (192.168.88.2). Next I mounted a directory on the client,
>> and started reading (tar'ing) a directory into /dev/null. It captured a
>> few stalls. Tcpdump shows number of packets it got, the stalls are at
>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
>>
>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>> take place on disk at all. This also makes it obvious that the issue
>> does not depend on the speed of disk on the server (since in this case,
>> the server disk isn't even in use).
>
> OK. So from the above file it looks as if the traffic is mainly READ
> requests.

The issue here happens only with reads.

> In 2 places the server stops responding. In both cases, the client seems
> to be sending a single TCP frame containing several COMPOUNDS containing
> READ requests (which should be legal) just prior to the hang. When the
> server doesn't respond, the client pings it with a RENEW, before it ends
> up severing the TCP connection and then retransmitting.

And sometimes -- speaking only from the behavour I've seen, not from the
actual frames sent -- server does not respond to the RENEW too, in which
case the client reports "nfs server no responding", and on the next
renew it may actually respond. This happens too, but much more rare.

During these stalls, ie, when there's no network activity at all,
the server NFSD threads are busy eating all available CPU.

What does it all tell us? :)

Thank you!

/mjt