2012-05-25 06:53:14

by Michael Tokarev

[permalink] [raw]
Subject: 3.0+ NFS issues

Hello.

I updated my nfs server machine to kernel 3.0, and
noticed that its main usage become, well, problematic.

While trying to dig deeper, I also found a few other
interesting issues, which are mentioned below.

But first thing first: nfs.

i686pae kernel, lots of RAM, Atom-based (cedar trail)
machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
(I will try current 3.4 but I don't have much hopes
there). NFSv4.

When a client machine (also 3.0 kernel) does some reading,
the process often stalls somewhere in the read syscall,
or, rarer, during close, for up to two MINUTES. During
this time, the client (kernel) reports "NFS server <foo>
does not respond" several times, and finally "NFS server
<foo> ok", client process "unstucks" from the read(2),
and is able to perform a few more reads till the whole
thing repeats.

While this happens, ie, while the client process is stuck,
any access to the mounted filesystem results in an
uninterruptible sleep - all processes including the
initial client are unkillable. This is very annoying,
but I was never able to fix this, including the real
case when the server becomes unreachable - "intr" mount
option which is supposed to help here makes no difference.
This is first and longstanding issue, but it is not really
related to the issue at hand.

On the server side things are more interesting. During
the client stall, all NFSD threads are busy running,
using all available CPU. The machine has 2 cores, and
if I reduce number of NFSD threads to 1, one core will
be used by the single NFSD to 100%. With default 8
NFSDs, both cores are used, by 4 NFSD threads each.
This is according to top(1), which show proportional
and equal CPU activity for each of NFSD thread.

But. And this is where things becomes more interesting.
perf top shows almost no activity with any number of NFSD
threads but one. For example, this is a typical perf top
display with 2 NFSD threads:

samples pcnt function DSO
1558.00 6.4% svc_recv [sunrpc]
1084.00 5.2% __schedule [kernel]
1065.00 5.1% select_task_rq_fair [kernel]
692.00 3.3% try_to_wake_up [kernel]


Yes, there's some 6.4% time spent in svc_recv, but it is
FAR from the 200% CPU usage reported by top(1). Now, when
I reduce number of NFSD threds to 1, perf top output becomes
dramatically different:

samples pcnt function DSO
5996.00 59.2% svc_recv [sunrpc]
750.00 7.4% cache_check [sunrpc]
519.00 5.1% svc_tcp_has_wspace [sunrpc]
465.00 4.6% local_bh_enable_ip [kernel.kallsyms]
301.00 3.0% svc_defer [sunrpc]
294.00 2.9% page_address [kernel.kallsyms]

Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times
the difference.

So, apparently, `perf top' tool does not show the truth.

Trying to catch this high CPU usage with Sysrq+T does not
show anything interesting, the NFSD thread is running (and
is the only thread(s) shown in the "Runnable tasks" output):


May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd R running 0 5853 2 0x00000000
May 25 10:48:37 gnome vmunix: [ 1709.483796] f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140
May 25 10:48:37 gnome vmunix: [ 1709.483841] 050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575
May 25 10:48:37 gnome vmunix: [ 1709.483887] c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000
May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace:
May 25 10:48:37 gnome vmunix: [ 1709.483950] [<c1048575>] ? irq_exit+0x35/0x90
May 25 10:48:37 gnome vmunix: [ 1709.483973] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
May 25 10:48:37 gnome vmunix: [ 1709.483997] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484029] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484053] [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484084] [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484107] [<c103d3e3>] ? __cond_resched+0x13/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484129] [<c12f92b5>] ? _cond_resched+0x25/0x30
May 25 10:48:37 gnome vmunix: [ 1709.484159] [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc]
May 25 10:48:37 gnome vmunix: [ 1709.484182] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
May 25 10:48:37 gnome vmunix: [ 1709.484206] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
May 25 10:48:37 gnome vmunix: [ 1709.484233] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
May 25 10:48:37 gnome vmunix: [ 1709.484255] [<c102d240>] ? complete+0x40/0x60
May 25 10:48:37 gnome vmunix: [ 1709.484279] [<f90ea000>] ? 0xf90e9fff
May 25 10:48:37 gnome vmunix: [ 1709.484299] [<c105ea99>] ? kthread+0x69/0x70
May 25 10:48:37 gnome vmunix: [ 1709.484321] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
May 25 10:48:37 gnome vmunix: [ 1709.484345] [<c1301876>] ? kernel_thread_helper+0x6/0xd

May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd R running 0 5853 2 0x00000000
May 25 10:49:37 gnome vmunix: [ 1769.982816] f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c
May 25 10:49:37 gnome vmunix: [ 1769.982859] 7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575
May 25 10:49:37 gnome vmunix: [ 1769.982903] f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c
May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace:
May 25 10:49:37 gnome vmunix: [ 1769.982966] [<c1048575>] ? irq_exit+0x35/0x90
May 25 10:49:37 gnome vmunix: [ 1769.982987] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
May 25 10:49:37 gnome vmunix: [ 1769.983019] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983050] [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983081] [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983112] [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983143] [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc]
May 25 10:49:37 gnome vmunix: [ 1769.983166] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
May 25 10:49:37 gnome vmunix: [ 1769.983193] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
May 25 10:49:37 gnome vmunix: [ 1769.983215] [<c102d240>] ? complete+0x40/0x60
May 25 10:49:37 gnome vmunix: [ 1769.983238] [<f90ea000>] ? 0xf90e9fff
May 25 10:49:37 gnome vmunix: [ 1769.983258] [<c105ea99>] ? kthread+0x69/0x70
May 25 10:49:37 gnome vmunix: [ 1769.983278] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
May 25 10:49:37 gnome vmunix: [ 1769.983301] [<c1301876>] ? kernel_thread_helper+0x6/0xd

This is 3.2, on 3.0 it was in different place.

This problem is persistent, always reproducible, happens
on reads from the server (I haven't seen it with writes),
and completely goes away when reverting to the kernel I
previously used on this machine - 2.6.32. But unfortunately
2.6.32 does not quite work on the new hardware (I replaced
the motherboard), due to lack of drivers for several pieces.
(This is a different story, I think NFSD should work in 3.0
regardless of 2.6.32 issues :)

Anything I can do to make NFS work again?

Thanks,

/mjt


2012-05-31 12:59:33

by Myklebust, Trond

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

T24gVGh1LCAyMDEyLTA1LTMxIGF0IDEwOjQ3ICswNDAwLCBNaWNoYWVsIFRva2FyZXYgd3JvdGU6
DQoNCj4gMTA6Mzg6NTMuODI3MzU1IElQICh0b3MgMHgwLCB0dGwgNjQsIGlkIDM1MTYwLCBvZmZz
ZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3RoIDI2OCkNCj4gICAgIDE5Mi4x
NjguODguMi4xMzk2NTQ4MDk4ID4gMTkyLjE2OC44OC40LjIwNDk6IDIxMiBnZXRhdHRyIGZoIDAs
MC8yMg0KPiAxMDozODo1My44MjczNzkgSVAgKHRvcyAweDAsIHR0bCA2NCwgaWQgMzUxNjEsIG9m
ZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5ndGggMjY4KQ0KPiAgICAgMTky
LjE2OC44OC4yLjE0MTMzMjUzMTQgPiAxOTIuMTY4Ljg4LjQuMjA0OTogMjEyIGdldGF0dHIgZmgg
MCwwLzIyDQo+IDEwOjM4OjUzLjgyNzM4NSBJUCAodG9zIDB4MCwgdHRsIDY0LCBpZCAzNTE2Miwg
b2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxlbmd0aCAyNjgpDQo+ICAgICAx
OTIuMTY4Ljg4LjIuMTQzMDEwMjUzMCA+IDE5Mi4xNjguODguNC4yMDQ5OiAyMTIgZ2V0YXR0ciBm
aCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDAwIElQICh0b3MgMHgwLCB0dGwgNjQsIGlkIDM1MTYz
LCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3RoIDI2OCkNCj4gICAg
IDE5Mi4xNjguODguMi4xNDQ2ODc5NzQ2ID4gMTkyLjE2OC44OC40LjIwNDk6IDIxMiBnZXRhdHRy
IGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0MDYgSVAgKHRvcyAweDAsIHR0bCA2NCwgaWQgMzUx
NjQsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5ndGggMjY4KQ0KPiAg
ICAgMTkyLjE2OC44OC4yLjE0NjM2NTY5NjIgPiAxOTIuMTY4Ljg4LjQuMjA0OTogMjEyIGdldGF0
dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQwOSBJUCAodG9zIDB4MCwgdHRsIDY0LCBpZCAz
NTE2NSwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxlbmd0aCAyNjgpDQo+
ICAgICAxOTIuMTY4Ljg4LjIuMTQ4MDQzNDE3OCA+IDE5Mi4xNjguODguNC4yMDQ5OiAyMTIgZ2V0
YXR0ciBmaCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDEzIElQICh0b3MgMHgwLCB0dGwgNjQsIGlk
IDM1MTY2LCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3RoIDI2OCkN
Cj4gICAgIDE5Mi4xNjguODguMi4xNDk3MjExMzk0ID4gMTkyLjE2OC44OC40LjIwNDk6IDIxMiBn
ZXRhdHRyIGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0MTcgSVAgKHRvcyAweDAsIHR0bCA2NCwg
aWQgMzUxNjcsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5ndGggMjY4
KQ0KPiAgICAgMTkyLjE2OC44OC4yLjE1MTM5ODg2MTAgPiAxOTIuMTY4Ljg4LjQuMjA0OTogMjEy
IGdldGF0dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQyMCBJUCAodG9zIDB4MCwgdHRsIDY0
LCBpZCAzNTE2OCwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxlbmd0aCAy
NjgpDQo+ICAgICAxOTIuMTY4Ljg4LjIuMTUzMDc2NTgyNiA+IDE5Mi4xNjguODguNC4yMDQ5OiAy
MTIgZ2V0YXR0ciBmaCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDI0IElQICh0b3MgMHgwLCB0dGwg
NjQsIGlkIDM1MTY5LCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwgbGVuZ3Ro
IDI2OCkNCj4gICAgIDE5Mi4xNjguODguMi4xNTQ3NTQzMDQyID4gMTkyLjE2OC44OC40LjIwNDk6
IDIxMiBnZXRhdHRyIGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0MjcgSVAgKHRvcyAweDAsIHR0
bCA2NCwgaWQgMzUxNzAsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYpLCBsZW5n
dGggMjY4KQ0KPiAgICAgMTkyLjE2OC44OC4yLjE1NjQzMjAyNTggPiAxOTIuMTY4Ljg4LjQuMjA0
OTogMjEyIGdldGF0dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQzNCBJUCAodG9zIDB4MCwg
dHRsIDY0LCBpZCAzNTE3MSwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAoNiksIGxl
bmd0aCAyNjgpDQo+ICAgICAxOTIuMTY4Ljg4LjIuMTU4MTA5NzQ3NCA+IDE5Mi4xNjguODguNC4y
MDQ5OiAyMTIgZ2V0YXR0ciBmaCAwLDAvMjINCj4gMTA6Mzg6NTMuODI3NDM4IElQICh0b3MgMHgw
LCB0dGwgNjQsIGlkIDM1MTcyLCBvZmZzZXQgMCwgZmxhZ3MgW0RGXSwgcHJvdG8gVENQICg2KSwg
bGVuZ3RoIDI2OCkNCj4gICAgIDE5Mi4xNjguODguMi4xNTk3ODc0NjkwID4gMTkyLjE2OC44OC40
LjIwNDk6IDIxMiBnZXRhdHRyIGZoIDAsMC8yMg0KPiAxMDozODo1My44Mjc0NDMgSVAgKHRvcyAw
eDAsIHR0bCA2NCwgaWQgMzUxNzMsIG9mZnNldCAwLCBmbGFncyBbREZdLCBwcm90byBUQ1AgKDYp
LCBsZW5ndGggMjY4KQ0KPiAgICAgMTkyLjE2OC44OC4yLjE2MTQ2NTE5MDYgPiAxOTIuMTY4Ljg4
LjQuMjA0OTogMjEyIGdldGF0dHIgZmggMCwwLzIyDQo+IDEwOjM4OjUzLjgyNzQ0NyBJUCAodG9z
IDB4MCwgdHRsIDY0LCBpZCAzNTE3NCwgb2Zmc2V0IDAsIGZsYWdzIFtERl0sIHByb3RvIFRDUCAo
NiksIGxlbmd0aCAyNjgpDQo+ICAgICAxOTIuMTY4Ljg4LjIuMTYzMTQyOTEyMiA+IDE5Mi4xNjgu
ODguNC4yMDQ5OiAyMTIgZ2V0YXR0ciBmaCAwLDAvMjINCg0KVGhhdCBpcyB0Y3BkdW1wIHRyeWlu
ZyB0byBpbnRlcnByZXQgeW91ciBORlN2NCB0cmFjZSBhcyBORlN2Mi92My4NCg0KQ2FuIHlvdSBl
aXRoZXIgcGxlYXNlIHVzZSB3aXJlc2hhcmsgdG8gcHJvdmlkZSBhIGZ1bGwgdGV4dCBkdW1wICh1
c2luZw0Kc29tZXRoaW5nIGxpa2UgJ3RzaGFyayAtViAtTyBuZnMscnBjJyksIG9yIGp1c3Qgc2Vu
ZCB1cyB0aGUgYmluYXJ5DQp0Y3BkdW1wIG91dHB1dCB1c2luZyAndGNwZHVtcCAtdyAvdG1wL2Zv
byAtcyA5MDAwMCc/DQoNCj4gPj4gQ2FuIGF0IGxlYXN0IHRoZSBjbGllbnQgYmUgbWFkZSBpbnRl
cnJ1cHRpYmxlPyAgTW91bnRpbmcgd2l0aA0KPiA+PiAtbyBpbnRyLHNvZnQgbWFrZXMgbm8gdmlz
aWJsZSBkaWZmZXJlbmNlLi4uDQo+IA0KPiBwbGVhc2U/IDopDQoNCkl0IGFscmVhZHkgaXMgaW50
ZXJydXB0aWJsZTogdHJ5ICdraWxsIC05JyBvciBhbnkgb3RoZXIgZmF0YWwgc2lnbmFsLg0KDQot
LSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFw
cA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQoNCg==

2012-05-31 13:46:32

by Myklebust, Trond

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

T24gVGh1LCAyMDEyLTA1LTMxIGF0IDE3OjI0ICswNDAwLCBNaWNoYWVsIFRva2FyZXYgd3JvdGU6
DQo+IE9uIDMxLjA1LjIwMTIgMTY6NTksIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+IFtdDQo+
ID4gVGhhdCBpcyB0Y3BkdW1wIHRyeWluZyB0byBpbnRlcnByZXQgeW91ciBORlN2NCB0cmFjZSBh
cyBORlN2Mi92My4NCj4gDQo+IE9oLg0KPiANCj4gPiBDYW4geW91IGVpdGhlciBwbGVhc2UgdXNl
IHdpcmVzaGFyayB0byBwcm92aWRlIGEgZnVsbCB0ZXh0IGR1bXAgKHVzaW5nDQo+ID4gc29tZXRo
aW5nIGxpa2UgJ3RzaGFyayAtViAtTyBuZnMscnBjJyksIG9yIGp1c3Qgc2VuZCB1cyB0aGUgYmlu
YXJ5DQo+ID4gdGNwZHVtcCBvdXRwdXQgdXNpbmcgJ3RjcGR1bXAgLXcgL3RtcC9mb28gLXMgOTAw
MDAnPw0KPiANCj4gSSBzdGFydGVkIHRjcGR1bXA6DQo+IA0KPiAgdGNwZHVtcCAtbnB2aSBicjAg
LXMgMCBob3N0IDE5Mi4xNjguODguNCBhbmQgXCggcHJvdG8gSUNNUCBvciBwb3J0IDIwNDkgXCkg
LXcgbmZzZHVtcA0KPiANCj4gb24gdGhlIGNsaWVudCAoMTkyLjE2OC44OC4yKS4gIE5leHQgSSBt
b3VudGVkIGEgZGlyZWN0b3J5IG9uIHRoZSBjbGllbnQsDQo+IGFuZCBzdGFydGVkIHJlYWRpbmcg
KHRhcidpbmcpIGEgZGlyZWN0b3J5IGludG8gL2Rldi9udWxsLiAgSXQgY2FwdHVyZWQgYQ0KPiBm
ZXcgc3RhbGxzLiAgVGNwZHVtcCBzaG93cyBudW1iZXIgb2YgcGFja2V0cyBpdCBnb3QsIHRoZSBz
dGFsbHMgYXJlIGF0DQo+IHBhY2tldCBjb3VudHMgNTgwOTAsIDk3MDY5IGFuZCA5NzA3MS4gIEkg
Y2FuY2VsbGVkIHRoZSBjYXB0dXJlIGFmdGVyIHRoYXQuDQo+IA0KPiBUaGUgcmVzdWx0aW5nIGZp
bGUgaXMgYXZhaWxhYmxlIGF0IGh0dHA6Ly93d3cuY29ycGl0LnJ1L21qdC90bXAvbmZzZHVtcC54
eiAsDQo+IGl0IGlzIDIyME1iIHVuY29tcHJlc3NlZCBhbmQgMS4zTWIgY29tcHJlc3NlZC4gIFRo
ZSBzb3VyY2UgZmlsZXMgYXJlDQo+IDEwIGZpbGVzIG9mIDFHYiBlYWNoLCBhbGwgbWFkZSBieSB1
c2luZyBgdHJ1bmNhdGUnIHV0aWxpdHksIHNvIGRvZXMgbm90DQo+IHRha2UgcGxhY2Ugb24gZGlz
ayBhdCBhbGwuICBUaGlzIGFsc28gbWFrZXMgaXQgb2J2aW91cyB0aGF0IHRoZSBpc3N1ZQ0KPiBk
b2VzIG5vdCBkZXBlbmQgb24gdGhlIHNwZWVkIG9mIGRpc2sgb24gdGhlIHNlcnZlciAoc2luY2Ug
aW4gdGhpcyBjYXNlLA0KPiB0aGUgc2VydmVyIGRpc2sgaXNuJ3QgZXZlbiBpbiB1c2UpLg0KDQpP
Sy4gU28gZnJvbSB0aGUgYWJvdmUgZmlsZSBpdCBsb29rcyBhcyBpZiB0aGUgdHJhZmZpYyBpcyBt
YWlubHkgUkVBRA0KcmVxdWVzdHMuDQoNCkluIDIgcGxhY2VzIHRoZSBzZXJ2ZXIgc3RvcHMgcmVz
cG9uZGluZy4gSW4gYm90aCBjYXNlcywgdGhlIGNsaWVudCBzZWVtcw0KdG8gYmUgc2VuZGluZyBh
IHNpbmdsZSBUQ1AgZnJhbWUgY29udGFpbmluZyBzZXZlcmFsIENPTVBPVU5EUyBjb250YWluaW5n
DQpSRUFEIHJlcXVlc3RzICh3aGljaCBzaG91bGQgYmUgbGVnYWwpIGp1c3QgcHJpb3IgdG8gdGhl
IGhhbmcuIFdoZW4gdGhlDQpzZXJ2ZXIgZG9lc24ndCByZXNwb25kLCB0aGUgY2xpZW50IHBpbmdz
IGl0IHdpdGggYSBSRU5FVywgYmVmb3JlIGl0IGVuZHMNCnVwIHNldmVyaW5nIHRoZSBUQ1AgY29u
bmVjdGlvbiBhbmQgdGhlbiByZXRyYW5zbWl0dGluZy4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QN
CkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBu
ZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo=

2012-05-30 07:11:45

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 29.05.2012 19:24, J. Bruce Fields wrote:
> On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
>> I updated my nfs server machine to kernel 3.0, and
>> noticed that its main usage become, well, problematic.
>>
>> While trying to dig deeper, I also found a few other
>> interesting issues, which are mentioned below.
>>
>> But first thing first: nfs.
>>
>> i686pae kernel, lots of RAM, Atom-based (cedar trail)
>> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
>> (I will try current 3.4 but I don't have much hopes
>> there). NFSv4.
>>
>> When a client machine (also 3.0 kernel) does some reading,
>> the process often stalls somewhere in the read syscall,
>> or, rarer, during close, for up to two MINUTES. During
>> this time, the client (kernel) reports "NFS server <foo>
>> does not respond" several times, and finally "NFS server
>> <foo> ok", client process "unstucks" from the read(2),
>> and is able to perform a few more reads till the whole
>> thing repeats.
>
> You say 2.6.32 was OK; have you tried anything else between?

Well, I thought bisecting between 2.6.32 and 3.0 will be quite
painful... But I'll try if nothing else helps. And no, I haven't
tried anything in-between.

> And you're holding the client constant while varying only the server
> version, right?

Yes.

> Is your network otherwise working? (E.g. does transferring a bunch of
> data from server to client using some other protocol work reliably?)

Yes, it works flawlessly, all other protocols works so far.

To the date, I resorted to using a small webserver plus wget as an ugly
workaround for the problem - http works for reads from the server, while
nfs works for writes.

> Is there anything happening on the network during these stalls? (You
> can watch the network with wireshark, for example.)

The network load is irrelevant - it behaves the same way with
100% idle network or with network busy doing other stuff.

> Does NFSv3 behave the same way?

Yes it does. With all NFSDs on server eating all available CPUs for
quite some time, and with being "ghosts" for perf top.

And with the client being unkillable again.

Can at least the client be made interruptible? Mounting with
-o intr,soft makes no visible difference...

Thanks,

/mjt

2012-05-30 13:25:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote:
> On 29.05.2012 19:24, J. Bruce Fields wrote:
> > On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
> >> I updated my nfs server machine to kernel 3.0, and
> >> noticed that its main usage become, well, problematic.
> >>
> >> While trying to dig deeper, I also found a few other
> >> interesting issues, which are mentioned below.
> >>
> >> But first thing first: nfs.
> >>
> >> i686pae kernel, lots of RAM, Atom-based (cedar trail)
> >> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
> >> (I will try current 3.4 but I don't have much hopes
> >> there). NFSv4.
> >>
> >> When a client machine (also 3.0 kernel) does some reading,
> >> the process often stalls somewhere in the read syscall,
> >> or, rarer, during close, for up to two MINUTES. During
> >> this time, the client (kernel) reports "NFS server <foo>
> >> does not respond" several times, and finally "NFS server
> >> <foo> ok", client process "unstucks" from the read(2),
> >> and is able to perform a few more reads till the whole
> >> thing repeats.
> >
> > You say 2.6.32 was OK; have you tried anything else between?
>
> Well, I thought bisecting between 2.6.32 and 3.0 will be quite
> painful... But I'll try if nothing else helps. And no, I haven't
> tried anything in-between.
>
> > And you're holding the client constant while varying only the server
> > version, right?
>
> Yes.
>
> > Is your network otherwise working? (E.g. does transferring a bunch of
> > data from server to client using some other protocol work reliably?)
>
> Yes, it works flawlessly, all other protocols works so far.
>
> To the date, I resorted to using a small webserver plus wget as an ugly
> workaround for the problem - http works for reads from the server, while
> nfs works for writes.
>
> > Is there anything happening on the network during these stalls? (You
> > can watch the network with wireshark, for example.)
>
> The network load is irrelevant - it behaves the same way with
> 100% idle network or with network busy doing other stuff.

That's not what I meant. During one of these read stalls, if you watch
the network with wireshark, do you see any NFS traffic between the
client and server?

Also: do you have a reliable way of reproducing this quickly?

--b.

> > Does NFSv3 behave the same way?
>
> Yes it does. With all NFSDs on server eating all available CPUs for
> quite some time, and with being "ghosts" for perf top.
>
> And with the client being unkillable again.
>
> Can at least the client be made interruptible? Mounting with
> -o intr,soft makes no visible difference...
>
> Thanks,
>
> /mjt

2012-05-31 13:51:21

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 31.05.2012 17:46, Myklebust, Trond wrote:
> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
[]
>> I started tcpdump:
>>
>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>
>> on the client (192.168.88.2). Next I mounted a directory on the client,
>> and started reading (tar'ing) a directory into /dev/null. It captured a
>> few stalls. Tcpdump shows number of packets it got, the stalls are at
>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
>>
>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>> take place on disk at all. This also makes it obvious that the issue
>> does not depend on the speed of disk on the server (since in this case,
>> the server disk isn't even in use).
>
> OK. So from the above file it looks as if the traffic is mainly READ
> requests.

The issue here happens only with reads.

> In 2 places the server stops responding. In both cases, the client seems
> to be sending a single TCP frame containing several COMPOUNDS containing
> READ requests (which should be legal) just prior to the hang. When the
> server doesn't respond, the client pings it with a RENEW, before it ends
> up severing the TCP connection and then retransmitting.

And sometimes -- speaking only from the behavour I've seen, not from the
actual frames sent -- server does not respond to the RENEW too, in which
case the client reports "nfs server no responding", and on the next
renew it may actually respond. This happens too, but much more rare.

During these stalls, ie, when there's no network activity at all,
the server NFSD threads are busy eating all available CPU.

What does it all tell us? :)

Thank you!

/mjt

2012-05-31 06:47:12

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 30.05.2012 17:25, J. Bruce Fields wrote:
> On Wed, May 30, 2012 at 11:11:42AM +0400, Michael Tokarev wrote:
[]
> That's not what I meant. During one of these read stalls, if you watch
> the network with wireshark, do you see any NFS traffic between the
> client and server?

Oh. Indeed, I misunderstood.

And no, during these stalls, there's no network activity at all.
Here's the typical scenario:

...
10:38:53.781990 IP (tos 0x0, ttl 64, id 35131, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xb43d), ack 89530281, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782000 IP (tos 0x0, ttl 64, id 6329, offset 0, flags [DF], proto TCP (6), length 1500)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe827 (correct), seq 89530281:89531729, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.782027 IP (tos 0x0, ttl 64, id 6330, offset 0, flags [DF], proto TCP (6), length 1708)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x37f6 (incorrect -> 0x6790), seq 89531729:89533385, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1656
10:38:53.782029 IP (tos 0x0, ttl 64, id 35132, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0xa81d), ack 89533385, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782040 IP (tos 0x0, ttl 64, id 6333, offset 0, flags [DF], proto TCP (6), length 1500)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x0d5b (correct), seq 89534833:89536281, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.782082 IP (tos 0x0, ttl 64, id 6334, offset 0, flags [DF], proto TCP (6), length 4396)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x4276 (incorrect -> 0x778a), seq 89536281:89540625, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 4344
10:38:53.782088 IP (tos 0x0, ttl 64, id 35134, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.2.880 > 192.168.88.4.2049: Flags [.], cksum 0x317e (incorrect -> 0x8bd5), ack 89540625, win 23173, options [nop,nop,TS val 3298129 ecr 122195208], length 0
10:38:53.782096 IP (tos 0x0, ttl 64, id 6337, offset 0, flags [DF], proto TCP (6), length 1500)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0x835d (correct), seq 89540625:89542073, ack 40321, win 6289, options [nop,nop,TS val 122195208 ecr 3298129], length 1448
10:38:53.827355 IP (tos 0x0, ttl 64, id 35160, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1396548098 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827379 IP (tos 0x0, ttl 64, id 35161, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1413325314 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827385 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1430102530 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827400 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1446879746 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827406 IP (tos 0x0, ttl 64, id 35164, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1463656962 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827409 IP (tos 0x0, ttl 64, id 35165, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1480434178 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827413 IP (tos 0x0, ttl 64, id 35166, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1497211394 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827417 IP (tos 0x0, ttl 64, id 35167, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1513988610 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827420 IP (tos 0x0, ttl 64, id 35168, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1530765826 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827424 IP (tos 0x0, ttl 64, id 35169, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1547543042 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827427 IP (tos 0x0, ttl 64, id 35170, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1564320258 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827434 IP (tos 0x0, ttl 64, id 35171, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1581097474 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827438 IP (tos 0x0, ttl 64, id 35172, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1597874690 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827443 IP (tos 0x0, ttl 64, id 35173, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1614651906 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827447 IP (tos 0x0, ttl 64, id 35174, offset 0, flags [DF], proto TCP (6), length 268)
192.168.88.2.1631429122 > 192.168.88.4.2049: 212 getattr fh 0,0/22
10:38:53.827673 IP (tos 0x0, ttl 64, id 6428, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xe4e5 (correct), ack 41617, win 6289, options [nop,nop,TS val 122195221 ecr 3298142], length 0
10:38:53.827699 IP (tos 0x0, ttl 64, id 6429, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdfd4 (correct), ack 42913, win 6289, options [nop,nop,TS val 122195221 ecr 3298143], length 0
10:38:53.865036 IP (tos 0x0, ttl 64, id 6430, offset 0, flags [DF], proto TCP (6), length 52)
192.168.88.4.2049 > 192.168.88.2.880: Flags [.], cksum 0xdd40 (correct), ack 43561, win 6289, options [nop,nop,TS val 122195233 ecr 3298143], length 0
[pause]
^C

192.168.88.2 is the client, .4 is the server.

I'm not sure if the series of getattr requests from the client is
right before or right after the beginning of the stall, but after
the 3 last replies from server there's no other activity for a
long time, and the server is eating 100% available CPU as I described
previously.

> Also: do you have a reliable way of reproducing this quickly?

Yes, it is enough to start copying any large file and in a few
seconds the first stall happens.

Can you suggest something for the other part of the question:

>> Can at least the client be made interruptible? Mounting with
>> -o intr,soft makes no visible difference...

please? :)

Thank you!

/mjt

2012-05-29 15:24:18

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On Fri, May 25, 2012 at 10:53:11AM +0400, Michael Tokarev wrote:
> I updated my nfs server machine to kernel 3.0, and
> noticed that its main usage become, well, problematic.
>
> While trying to dig deeper, I also found a few other
> interesting issues, which are mentioned below.
>
> But first thing first: nfs.
>
> i686pae kernel, lots of RAM, Atom-based (cedar trail)
> machine with usual rtl8169 NIC. 3.0 or 3.2 kernel
> (I will try current 3.4 but I don't have much hopes
> there). NFSv4.
>
> When a client machine (also 3.0 kernel) does some reading,
> the process often stalls somewhere in the read syscall,
> or, rarer, during close, for up to two MINUTES. During
> this time, the client (kernel) reports "NFS server <foo>
> does not respond" several times, and finally "NFS server
> <foo> ok", client process "unstucks" from the read(2),
> and is able to perform a few more reads till the whole
> thing repeats.

You say 2.6.32 was OK; have you tried anything else between?

And you're holding the client constant while varying only the server
version, right?

Is your network otherwise working? (E.g. does transferring a bunch of
data from server to client using some other protocol work reliably?)

Is there anything happening on the network during these stalls? (You
can watch the network with wireshark, for example.)

Does NFSv3 behave the same way?

--b.

>
> While this happens, ie, while the client process is stuck,
> any access to the mounted filesystem results in an
> uninterruptible sleep - all processes including the
> initial client are unkillable. This is very annoying,
> but I was never able to fix this, including the real
> case when the server becomes unreachable - "intr" mount
> option which is supposed to help here makes no difference.
> This is first and longstanding issue, but it is not really
> related to the issue at hand.
>
> On the server side things are more interesting. During
> the client stall, all NFSD threads are busy running,
> using all available CPU. The machine has 2 cores, and
> if I reduce number of NFSD threads to 1, one core will
> be used by the single NFSD to 100%. With default 8
> NFSDs, both cores are used, by 4 NFSD threads each.
> This is according to top(1), which show proportional
> and equal CPU activity for each of NFSD thread.
>
> But. And this is where things becomes more interesting.
> perf top shows almost no activity with any number of NFSD
> threads but one. For example, this is a typical perf top
> display with 2 NFSD threads:
>
> samples pcnt function DSO
> 1558.00 6.4% svc_recv [sunrpc]
> 1084.00 5.2% __schedule [kernel]
> 1065.00 5.1% select_task_rq_fair [kernel]
> 692.00 3.3% try_to_wake_up [kernel]
>
>
> Yes, there's some 6.4% time spent in svc_recv, but it is
> FAR from the 200% CPU usage reported by top(1). Now, when
> I reduce number of NFSD threds to 1, perf top output becomes
> dramatically different:
>
> samples pcnt function DSO
> 5996.00 59.2% svc_recv [sunrpc]
> 750.00 7.4% cache_check [sunrpc]
> 519.00 5.1% svc_tcp_has_wspace [sunrpc]
> 465.00 4.6% local_bh_enable_ip [kernel.kallsyms]
> 301.00 3.0% svc_defer [sunrpc]
> 294.00 2.9% page_address [kernel.kallsyms]
>
> Ie, scv_recv time usage jumps from 6 to 60%, ie, 10 times
> the difference.
>
> So, apparently, `perf top' tool does not show the truth.
>
> Trying to catch this high CPU usage with Sysrq+T does not
> show anything interesting, the NFSD thread is running (and
> is the only thread(s) shown in the "Runnable tasks" output):
>
>
> May 25 10:48:37 gnome vmunix: [ 1709.483771] nfsd R running 0 5853 2 0x00000000
> May 25 10:48:37 gnome vmunix: [ 1709.483796] f4477030 00000046 00000000 5c8e8929 0000018e f4c700f0 f4043a70 f545b140
> May 25 10:48:37 gnome vmunix: [ 1709.483841] 050a8544 0000018e c1469140 c1469140 f4043a70 00000086 00000000 c1048575
> May 25 10:48:37 gnome vmunix: [ 1709.483887] c101ba11 c12fac18 f31bf800 f8f7bd30 0000003a c12fb156 0000003a 00000000
> May 25 10:48:37 gnome vmunix: [ 1709.483931] Call Trace:
> May 25 10:48:37 gnome vmunix: [ 1709.483950] [<c1048575>] ? irq_exit+0x35/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.483973] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.483997] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484029] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484053] [<c12fb156>] ? apic_timer_interrupt+0x2a/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484084] [<f8f7007b>] ? svc_destroy+0xdb/0x110 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484107] [<c103d3e3>] ? __cond_resched+0x13/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484129] [<c12f92b5>] ? _cond_resched+0x25/0x30
> May 25 10:48:37 gnome vmunix: [ 1709.484159] [<f8f7cb5d>] ? svc_recv+0x1cd/0x7e0 [sunrpc]
> May 25 10:48:37 gnome vmunix: [ 1709.484182] [<c101ba11>] ? smp_apic_timer_interrupt+0x51/0x90
> May 25 10:48:37 gnome vmunix: [ 1709.484206] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
> May 25 10:48:37 gnome vmunix: [ 1709.484233] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
> May 25 10:48:37 gnome vmunix: [ 1709.484255] [<c102d240>] ? complete+0x40/0x60
> May 25 10:48:37 gnome vmunix: [ 1709.484279] [<f90ea000>] ? 0xf90e9fff
> May 25 10:48:37 gnome vmunix: [ 1709.484299] [<c105ea99>] ? kthread+0x69/0x70
> May 25 10:48:37 gnome vmunix: [ 1709.484321] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
> May 25 10:48:37 gnome vmunix: [ 1709.484345] [<c1301876>] ? kernel_thread_helper+0x6/0xd
>
> May 25 10:49:37 gnome vmunix: [ 1769.982791] nfsd R running 0 5853 2 0x00000000
> May 25 10:49:37 gnome vmunix: [ 1769.982816] f5449f80 00000000 00000000 7636b581 0000019c 00000003 7603dcb8 0000019c
> May 25 10:49:37 gnome vmunix: [ 1769.982859] 7603dcb8 0000019c 00000046 00000000 f4043a70 00000086 00000000 c1048575
> May 25 10:49:37 gnome vmunix: [ 1769.982903] f31bf800 c12fac18 f31bf800 f8f7bd30 00000047 f31bf800 f418c2c0 f8f7c03c
> May 25 10:49:37 gnome vmunix: [ 1769.982948] Call Trace:
> May 25 10:49:37 gnome vmunix: [ 1769.982966] [<c1048575>] ? irq_exit+0x35/0x90
> May 25 10:49:37 gnome vmunix: [ 1769.982987] [<c12fac18>] ? _raw_spin_lock_bh+0x8/0x30
> May 25 10:49:37 gnome vmunix: [ 1769.983019] [<f8f7bd30>] ? svc_xprt_enqueue+0x40/0x1e0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983050] [<f8f7c03c>] ? svc_xprt_release+0x1c/0xd0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983081] [<f8f7c0ad>] ? svc_xprt_release+0x8d/0xd0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983112] [<f8f7cc03>] ? svc_recv+0x273/0x7e0 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983143] [<f8f7c300>] ? svc_age_temp_xprts+0x110/0x190 [sunrpc]
> May 25 10:49:37 gnome vmunix: [ 1769.983166] [<c103a3d0>] ? try_to_wake_up+0x200/0x200
> May 25 10:49:37 gnome vmunix: [ 1769.983193] [<f90ea08c>] ? nfsd+0x8c/0x120 [nfsd]
> May 25 10:49:37 gnome vmunix: [ 1769.983215] [<c102d240>] ? complete+0x40/0x60
> May 25 10:49:37 gnome vmunix: [ 1769.983238] [<f90ea000>] ? 0xf90e9fff
> May 25 10:49:37 gnome vmunix: [ 1769.983258] [<c105ea99>] ? kthread+0x69/0x70
> May 25 10:49:37 gnome vmunix: [ 1769.983278] [<c105ea30>] ? kthread_worker_fn+0x150/0x150
> May 25 10:49:37 gnome vmunix: [ 1769.983301] [<c1301876>] ? kernel_thread_helper+0x6/0xd
>
> This is 3.2, on 3.0 it was in different place.
>
> This problem is persistent, always reproducible, happens
> on reads from the server (I haven't seen it with writes),
> and completely goes away when reverting to the kernel I
> previously used on this machine - 2.6.32. But unfortunately
> 2.6.32 does not quite work on the new hardware (I replaced
> the motherboard), due to lack of drivers for several pieces.
> (This is a different story, I think NFSD should work in 3.0
> regardless of 2.6.32 issues :)
>
> Anything I can do to make NFS work again?
>
> Thanks,
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-05-31 13:25:01

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On 31.05.2012 16:59, Myklebust, Trond wrote:
[]
> That is tcpdump trying to interpret your NFSv4 trace as NFSv2/v3.

Oh.

> Can you either please use wireshark to provide a full text dump (using
> something like 'tshark -V -O nfs,rpc'), or just send us the binary
> tcpdump output using 'tcpdump -w /tmp/foo -s 90000'?

I started tcpdump:

tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump

on the client (192.168.88.2). Next I mounted a directory on the client,
and started reading (tar'ing) a directory into /dev/null. It captured a
few stalls. Tcpdump shows number of packets it got, the stalls are at
packet counts 58090, 97069 and 97071. I cancelled the capture after that.

The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
it is 220Mb uncompressed and 1.3Mb compressed. The source files are
10 files of 1Gb each, all made by using `truncate' utility, so does not
take place on disk at all. This also makes it obvious that the issue
does not depend on the speed of disk on the server (since in this case,
the server disk isn't even in use).

What I noticed is: with default 8 NFSD threads, it takes quite a bit
longer to hit this issue, but it still happens. The capture was
done using 2 threads.

>>>> Can at least the client be made interruptible? Mounting with
>>>> -o intr,soft makes no visible difference...
>>
>> please? :)
>
> It already is interruptible: try 'kill -9' or any other fatal signal.

Oh. Indeed, I can kill it from another terminal using -9. That is
great already!

Thank you!

/mjt

2012-06-20 12:55:43

by Christoph Bartoschek

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

>> In 2 places the server stops responding. In both cases, the client seems
>> to be sending a single TCP frame containing several COMPOUNDS containing
>> READ requests (which should be legal) just prior to the hang. When the
>> server doesn't respond, the client pings it with a RENEW, before it ends
>> up severing the TCP connection and then retransmitting.
>
> And sometimes -- speaking only from the behavour I've seen, not from the
> actual frames sent -- server does not respond to the RENEW too, in which
> case the client reports "nfs server no responding", and on the next
> renew it may actually respond. This happens too, but much more rare.
>
> During these stalls, ie, when there's no network activity at all,
> the server NFSD threads are busy eating all available CPU.
>
> What does it all tell us? :)

Hi,

are there news on this issue? We see bad NFS performance on systems using
Ubuntu 12.04 compared to systems using Ubuntu 10.04 that look similar.

There are also two bug reports for Ubuntu with some information:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1006446
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/879334

Chrsitoph


2012-07-10 12:52:07

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

I tried to debug this again, maybe to reproduce in a virtual machine,
and found out that it is only 32bit server code shows this issue:
after updating the kernel on the server to 64bit (the same version)
I can't reproduce this issue anymore. Rebooting back to 32bit,
and voila, it is here again.

Something apparenlty isn't right on 32bits... ;)

(And yes, the prob is still present and is very annoying :)

Thanks,

/mjt


On 31.05.2012 17:51, Michael Tokarev wrote:
> On 31.05.2012 17:46, Myklebust, Trond wrote:
>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> []
>>> I started tcpdump:
>>>
>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>>
>>> on the client (192.168.88.2). Next I mounted a directory on the client,
>>> and started reading (tar'ing) a directory into /dev/null. It captured a
>>> few stalls. Tcpdump shows number of packets it got, the stalls are at
>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
>>>
>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>>> take place on disk at all. This also makes it obvious that the issue
>>> does not depend on the speed of disk on the server (since in this case,
>>> the server disk isn't even in use).
>>
>> OK. So from the above file it looks as if the traffic is mainly READ
>> requests.
>
> The issue here happens only with reads.
>
>> In 2 places the server stops responding. In both cases, the client seems
>> to be sending a single TCP frame containing several COMPOUNDS containing
>> READ requests (which should be legal) just prior to the hang. When the
>> server doesn't respond, the client pings it with a RENEW, before it ends
>> up severing the TCP connection and then retransmitting.
>
> And sometimes -- speaking only from the behavour I've seen, not from the
> actual frames sent -- server does not respond to the RENEW too, in which
> case the client reports "nfs server no responding", and on the next
> renew it may actually respond. This happens too, but much more rare.
>
> During these stalls, ie, when there's no network activity at all,
> the server NFSD threads are busy eating all available CPU.
>
> What does it all tell us? :)
>
> Thank you!
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2012-07-12 12:53:07

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues

On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> I tried to debug this again, maybe to reproduce in a virtual machine,
> and found out that it is only 32bit server code shows this issue:
> after updating the kernel on the server to 64bit (the same version)
> I can't reproduce this issue anymore. Rebooting back to 32bit,
> and voila, it is here again.
>
> Something apparenlty isn't right on 32bits... ;)
>
> (And yes, the prob is still present and is very annoying :)

OK, that's very useful, thanks. So probably a bug got introduced in the
32-bit case between 2.6.32 and 3.0.

My personal upstream testing is normally all x86_64 only. I'll kick off
a 32-bit install and see if I can reproduce this quickly.

Let me know if you're able to narrow this down any more.

--b.

>
> Thanks,
>
> /mjt
>
>
> On 31.05.2012 17:51, Michael Tokarev wrote:
> > On 31.05.2012 17:46, Myklebust, Trond wrote:
> >> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> > []
> >>> I started tcpdump:
> >>>
> >>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> >>>
> >>> on the client (192.168.88.2). Next I mounted a directory on the client,
> >>> and started reading (tar'ing) a directory into /dev/null. It captured a
> >>> few stalls. Tcpdump shows number of packets it got, the stalls are at
> >>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
> >>>
> >>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> >>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
> >>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> >>> take place on disk at all. This also makes it obvious that the issue
> >>> does not depend on the speed of disk on the server (since in this case,
> >>> the server disk isn't even in use).
> >>
> >> OK. So from the above file it looks as if the traffic is mainly READ
> >> requests.
> >
> > The issue here happens only with reads.
> >
> >> In 2 places the server stops responding. In both cases, the client seems
> >> to be sending a single TCP frame containing several COMPOUNDS containing
> >> READ requests (which should be legal) just prior to the hang. When the
> >> server doesn't respond, the client pings it with a RENEW, before it ends
> >> up severing the TCP connection and then retransmitting.
> >
> > And sometimes -- speaking only from the behavour I've seen, not from the
> > actual frames sent -- server does not respond to the RENEW too, in which
> > case the client reports "nfs server no responding", and on the next
> > renew it may actually respond. This happens too, but much more rare.
> >
> > During these stalls, ie, when there's no network activity at all,
> > the server NFSD threads are busy eating all available CPU.
> >
> > What does it all tell us? :)
> >
> > Thank you!
> >
> > /mjt
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>

2012-08-17 22:32:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
> Wait a minute, that assumption's a problem because that calculation
> depends in part on xpt_reserved, which is changed here....
>
> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
> lower xpt_reserved value. That could well explain this.

So, maybe something like this?

--b.

commit c8136c319ad85d0db870021fc3f9074d37f26d4a
Author: J. Bruce Fields <[email protected]>
Date: Fri Aug 17 17:31:53 2012 -0400

svcrpc: don't add to xpt_reserved till we receive

The rpc server tries to ensure that there will be room to send a reply
before it receives a request.

It does this by tracking, in xpt_reserved, an upper bound on the total
size of the replies that is has already committed to for the socket.

Currently it is adding in the estimate for a new reply *before* it
checks whether there is space available. If it finds that there is not
space, it then subtracts the estimate back out.

This may lead the subsequent svc_xprt_enqueue to decide that there is
space after all.

The results is a svc_recv() that will repeatedly return -EAGAIN, causing
server threads to loop without doing any actual work.

Reported-by: Michael Tokarev <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ec99849a..59ff3a3 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
rqstp, rqstp->rq_xprt);
rqstp->rq_xprt = xprt;
svc_xprt_get(xprt);
- rqstp->rq_reserved = serv->sv_max_mesg;
- atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
pool->sp_stats.threads_woken++;
wake_up(&rqstp->rq_wait);
} else {
@@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
if (xprt) {
rqstp->rq_xprt = xprt;
svc_xprt_get(xprt);
- rqstp->rq_reserved = serv->sv_max_mesg;
- atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);

/* As there is a shortage of threads and this request
* had to be queued, don't allow the thread to wait so
@@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
len = xprt->xpt_ops->xpo_recvfrom(rqstp);
dprintk("svc: got len=%d\n", len);
}
+ if (len > 0) {
+ rqstp->rq_reserved = serv->sv_max_mesg;
+ atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
+ }
svc_xprt_received(xprt);

/* No data, incomplete (TCP) read, or accept() */

2012-08-17 17:26:29

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 17.08.2012 21:18, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
[]
>> So we're calling svc_recv in a tight loop, eating
>> all available CPU. (The above is with just 2 nfsd
>> threads).
>>
>> Something is definitely wrong here. And it happens mure more
>> often after the mentioned commit (f03d78db65085).
>
> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
> Is that busy-looping reproduceable on kernels before that commit?

Note I bisected this issue to this commit. I haven't seen it
happening before this commit, and reverting it from 3.0 or 3.2
kernel makes the problem to go away.

I guess it is looping there:


net/sunrpc/svc_xprt.c:svc_recv()
...
len = 0;
...
if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
...
} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
... len = <something>
}

/* No data, incomplete (TCP) read, or accept() */
if (len == 0 || len == -EAGAIN)
goto out;
...
out:
rqstp->rq_res.len = 0;
svc_xprt_release(rqstp);
return -EAGAIN;
}

I'm trying to verify this theory...

/mjt

2012-08-17 17:12:41

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 17.08.2012 20:00, J. Bruce Fields wrote:
[]> Uh, if I grepped my way through this right: it looks like it's the
> "memory" column of the "TCP" row of /proc/net/protocols; might be
> interesting to see how that's changing over time.

This file does not look interesting. Memory usage does not jump,
there's no high increase either.

But there's something else which is interesting here.

I noticed that in perf top, the top consumer of CPU is svc_recv()
(I mentioned this in the start of this thread). So I looked how
this routine is called from nfsd. And here we go.

fs/nfsd/nfssvc.c:

/*
* This is the NFS server kernel thread
*/
static int
nfsd(void *vrqstp)
{
...
/*
* The main request loop
*/
for (;;) {
/*
* Find a socket with data available and call its
* recvfrom routine.
*/
int i = 0;
while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
++i;
printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err);
if (err == -EINTR)
break;
...

(I added the "i" counter and the printk). And here's the output:

[19626.401136] calling svc_recv: 0 times (err=212)
[19626.405059] calling svc_recv: 1478 times (err=212)
[19626.409512] calling svc_recv: 1106 times (err=212)
[19626.543020] calling svc_recv: 0 times (err=212)
[19626.543059] calling svc_recv: 0 times (err=212)
[19626.548074] calling svc_recv: 0 times (err=212)
[19626.549515] calling svc_recv: 0 times (err=212)
[19626.552320] calling svc_recv: 0 times (err=212)
[19626.553503] calling svc_recv: 0 times (err=212)
[19626.556007] calling svc_recv: 0 times (err=212)
[19626.557152] calling svc_recv: 0 times (err=212)
[19626.560109] calling svc_recv: 0 times (err=212)
[19626.560943] calling svc_recv: 0 times (err=212)
[19626.565315] calling svc_recv: 1067 times (err=212)
[19626.569735] calling svc_recv: 2571 times (err=212)
[19626.574150] calling svc_recv: 3842 times (err=212)
[19626.581914] calling svc_recv: 2891 times (err=212)
[19626.583072] calling svc_recv: 1247 times (err=212)
[19626.616885] calling svc_recv: 0 times (err=212)
[19626.616952] calling svc_recv: 0 times (err=212)
[19626.622889] calling svc_recv: 0 times (err=212)
[19626.624518] calling svc_recv: 0 times (err=212)
[19626.627118] calling svc_recv: 0 times (err=212)
[19626.629735] calling svc_recv: 0 times (err=212)
[19626.631777] calling svc_recv: 0 times (err=212)
[19626.633986] calling svc_recv: 0 times (err=212)
[19626.636746] calling svc_recv: 0 times (err=212)
[19626.637692] calling svc_recv: 0 times (err=212)
[19626.640769] calling svc_recv: 0 times (err=212)
[19626.657852] calling svc_recv: 0 times (err=212)
[19626.661602] calling svc_recv: 0 times (err=212)
[19626.670160] calling svc_recv: 0 times (err=212)
[19626.671917] calling svc_recv: 0 times (err=212)
[19626.684643] calling svc_recv: 0 times (err=212)
[19626.684680] calling svc_recv: 0 times (err=212)
[19626.812820] calling svc_recv: 0 times (err=212)
[19626.814697] calling svc_recv: 0 times (err=212)
[19626.817195] calling svc_recv: 0 times (err=212)
[19626.820324] calling svc_recv: 0 times (err=212)
[19626.822855] calling svc_recv: 0 times (err=212)
[19626.824823] calling svc_recv: 0 times (err=212)
[19626.828016] calling svc_recv: 0 times (err=212)
[19626.829021] calling svc_recv: 0 times (err=212)
[19626.831970] calling svc_recv: 0 times (err=212)

> the stall begin:
[19686.823135] calling svc_recv: 3670352 times (err=212)
[19686.823524] calling svc_recv: 3659205 times (err=212)

> transfer continues
[19686.854734] calling svc_recv: 0 times (err=212)
[19686.860023] calling svc_recv: 0 times (err=212)
[19686.887124] calling svc_recv: 0 times (err=212)
[19686.895532] calling svc_recv: 0 times (err=212)
[19686.903667] calling svc_recv: 0 times (err=212)
[19686.922780] calling svc_recv: 0 times (err=212)

So we're calling svc_recv in a tight loop, eating
all available CPU. (The above is with just 2 nfsd
threads).

Something is definitely wrong here. And it happens mure more
often after the mentioned commit (f03d78db65085).

Thanks,

/mjt

2012-08-17 16:01:01

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 10:56:16AM -0400, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote:
> > On 12.07.2012 16:53, J. Bruce Fields wrote:
> > > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> > >> I tried to debug this again, maybe to reproduce in a virtual machine,
> > >> and found out that it is only 32bit server code shows this issue:
> > >> after updating the kernel on the server to 64bit (the same version)
> > >> I can't reproduce this issue anymore. Rebooting back to 32bit,
> > >> and voila, it is here again.
> > >>
> > >> Something apparenlty isn't right on 32bits... ;)
> > >>
> > >> (And yes, the prob is still present and is very annoying :)
> > >
> > > OK, that's very useful, thanks. So probably a bug got introduced in the
> > > 32-bit case between 2.6.32 and 3.0.
> > >
> > > My personal upstream testing is normally all x86_64 only. I'll kick off
> > > a 32-bit install and see if I can reproduce this quickly.
> >
> > Actually it has nothing to do with 32 vs 64 bits as I
> > initially thought. It happens on 64bits too, but takes
> > more time (or data to transfer) to trigger.
>
> That makes it sound like some kind of leak: you're hitting this case
> eventually either way, but it takes longer in the case where you have
> more (low) memory.
>
> I wish I was more familiar with the tcp code.... What number exactly is
> being compared against those limits, and how could we watch it from
> userspace?

Uh, if I grepped my way through this right: it looks like it's the
"memory" column of the "TCP" row of /proc/net/protocols; might be
interesting to see how that's changing over time.

>
> --b.
>
> >
> >
> > > Let me know if you're able to narrow this down any more.
> >
> > I bisected this issue to the following commit:
> >
> > commit f03d78db65085609938fdb686238867e65003181
> > Author: Eric Dumazet <[email protected]>
> > Date: Thu Jul 7 00:27:05 2011 -0700
> >
> > net: refine {udp|tcp|sctp}_mem limits
> >
> > Current tcp/udp/sctp global memory limits are not taking into account
> > hugepages allocations, and allow 50% of ram to be used by buffers of a
> > single protocol [ not counting space used by sockets / inodes ...]
> >
> > Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
> > per protocol, and a minimum of 128 pages.
> > Heavy duty machines sysadmins probably need to tweak limits anyway.
> >
> >
> > Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
> > the behavour here.
> >
> > This machine has 4Gb of memory. On 3.0, with this patch applied
> > (as it is part of 3.0), tcp_mem is like this:
> >
> > 21228 28306 42456
> >
> > with this patch reverted, tcp_mem shows:
> >
> > 81216 108288 162432
> >
> > and with these values, it works fine.
> >
> > So it looks like something else goes wrong there,
> > which lead to all nfsds fighting with each other
> > for something and eating 100% of available CPU
> > instead of servicing clients.
> >
> > For added fun, when setting tcp_mem to the "good" value
> > from "bad" value (after booting into kernel with that
> > patch applied), the problem is _not_ fixed.
> >
> > Any further hints?
> >
> > Thanks,
> >
> > /mjt
> >
> > >> On 31.05.2012 17:51, Michael Tokarev wrote:
> > >>> On 31.05.2012 17:46, Myklebust, Trond wrote:
> > >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> > >>> []
> > >>>>> I started tcpdump:
> > >>>>>
> > >>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> > >>>>>
> > >>>>> on the client (192.168.88.2). Next I mounted a directory on the client,
> > >>>>> and started reading (tar'ing) a directory into /dev/null. It captured a
> > >>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at
> > >>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
> > >>>>>
> > >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> > >>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
> > >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> > >>>>> take place on disk at all. This also makes it obvious that the issue
> > >>>>> does not depend on the speed of disk on the server (since in this case,
> > >>>>> the server disk isn't even in use).
> > >>>>
> > >>>> OK. So from the above file it looks as if the traffic is mainly READ
> > >>>> requests.
> > >>>
> > >>> The issue here happens only with reads.
> > >>>
> > >>>> In 2 places the server stops responding. In both cases, the client seems
> > >>>> to be sending a single TCP frame containing several COMPOUNDS containing
> > >>>> READ requests (which should be legal) just prior to the hang. When the
> > >>>> server doesn't respond, the client pings it with a RENEW, before it ends
> > >>>> up severing the TCP connection and then retransmitting.
> > >>>
> > >>> And sometimes -- speaking only from the behavour I've seen, not from the
> > >>> actual frames sent -- server does not respond to the RENEW too, in which
> > >>> case the client reports "nfs server no responding", and on the next
> > >>> renew it may actually respond. This happens too, but much more rare.
> > >>>
> > >>> During these stalls, ie, when there's no network activity at all,
> > >>> the server NFSD threads are busy eating all available CPU.
> > >>>
> > >>> What does it all tell us? :)
> > >>>
> > >>> Thank you!
> > >>>
> > >>> /mjt
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >>> the body of a message to [email protected]
> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >>> Please read the FAQ at http://www.tux.org/lkml/
> > >>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> >

2012-08-18 12:58:33

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 18.08.2012 15:13, J. Bruce Fields wrote:
> On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote:
[]
>> Well. What can I say? With the change below applied (to 3.2 kernel
>> at least), I don't see any stalls or high CPU usage on the server
>> anymore. It survived several multi-gigabyte transfers, for several
>> hours, without any problem. So it is a good step forward ;)
>>
>> But the whole thing seems to be quite a bit fragile. I tried to follow
>> the logic in there, and the thing is quite a bit, well, "twisted", and
>> somewhat difficult to follow. So I don't know if this is the right
>> fix or not. At least it works! :)
>
> Suggestions welcomed.

Ok...

Meanwhile, you can add my
Tested-By: Michael Tokarev <[email protected]>

to the patch.

>> And I really wonder why no one else reported this problem before.
>> Is me the only one in this world who uses linux nfsd? :)
>
> This, for example:
>
> http://marc.info/?l=linux-nfs&m=134131915612287&w=2
>
> may well describe the same problem.... It just needed some debugging
> persistence, thanks!

Ah. I tried to find something when I initially
sent this report, but weren't able to. Apparently
I'm not alone with this problem indeed!

Thank you for all the work!

/mjt

2012-08-18 06:49:33

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 18.08.2012 02:32, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
>> Wait a minute, that assumption's a problem because that calculation
>> depends in part on xpt_reserved, which is changed here....
>>
>> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
>> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
>> lower xpt_reserved value. That could well explain this.
>
> So, maybe something like this?

Well. What can I say? With the change below applied (to 3.2 kernel
at least), I don't see any stalls or high CPU usage on the server
anymore. It survived several multi-gigabyte transfers, for several
hours, without any problem. So it is a good step forward ;)

But the whole thing seems to be quite a bit fragile. I tried to follow
the logic in there, and the thing is quite a bit, well, "twisted", and
somewhat difficult to follow. So I don't know if this is the right
fix or not. At least it works! :)

And I really wonder why no one else reported this problem before.
Is me the only one in this world who uses linux nfsd? :)

Thank you for all your patience and the proposed fix!

/mjt

> commit c8136c319ad85d0db870021fc3f9074d37f26d4a
> Author: J. Bruce Fields <[email protected]>
> Date: Fri Aug 17 17:31:53 2012 -0400
>
> svcrpc: don't add to xpt_reserved till we receive
>
> The rpc server tries to ensure that there will be room to send a reply
> before it receives a request.
>
> It does this by tracking, in xpt_reserved, an upper bound on the total
> size of the replies that is has already committed to for the socket.
>
> Currently it is adding in the estimate for a new reply *before* it
> checks whether there is space available. If it finds that there is not
> space, it then subtracts the estimate back out.
>
> This may lead the subsequent svc_xprt_enqueue to decide that there is
> space after all.
>
> The results is a svc_recv() that will repeatedly return -EAGAIN, causing
> server threads to loop without doing any actual work.
>
> Reported-by: Michael Tokarev <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index ec99849a..59ff3a3 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -366,8 +366,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> rqstp, rqstp->rq_xprt);
> rqstp->rq_xprt = xprt;
> svc_xprt_get(xprt);
> - rqstp->rq_reserved = serv->sv_max_mesg;
> - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> pool->sp_stats.threads_woken++;
> wake_up(&rqstp->rq_wait);
> } else {
> @@ -644,8 +642,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
> if (xprt) {
> rqstp->rq_xprt = xprt;
> svc_xprt_get(xprt);
> - rqstp->rq_reserved = serv->sv_max_mesg;
> - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>
> /* As there is a shortage of threads and this request
> * had to be queued, don't allow the thread to wait so
> @@ -743,6 +739,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
> len = xprt->xpt_ops->xpo_recvfrom(rqstp);
> dprintk("svc: got len=%d\n", len);
> }
> + if (len > 0) {
> + rqstp->rq_reserved = serv->sv_max_mesg;
> + atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> + }
> svc_xprt_received(xprt);
>
> /* No data, incomplete (TCP) read, or accept() */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2012-08-17 17:18:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> On 17.08.2012 20:00, J. Bruce Fields wrote:
> []> Uh, if I grepped my way through this right: it looks like it's the
> > "memory" column of the "TCP" row of /proc/net/protocols; might be
> > interesting to see how that's changing over time.
>
> This file does not look interesting. Memory usage does not jump,
> there's no high increase either.
>
> But there's something else which is interesting here.
>
> I noticed that in perf top, the top consumer of CPU is svc_recv()
> (I mentioned this in the start of this thread). So I looked how
> this routine is called from nfsd. And here we go.
>
> fs/nfsd/nfssvc.c:
>
> /*
> * This is the NFS server kernel thread
> */
> static int
> nfsd(void *vrqstp)
> {
> ...
> /*
> * The main request loop
> */
> for (;;) {
> /*
> * Find a socket with data available and call its
> * recvfrom routine.
> */
> int i = 0;
> while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> ++i;
> printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err);
> if (err == -EINTR)
> break;
> ...
>
> (I added the "i" counter and the printk). And here's the output:
>
> [19626.401136] calling svc_recv: 0 times (err=212)
> [19626.405059] calling svc_recv: 1478 times (err=212)
> [19626.409512] calling svc_recv: 1106 times (err=212)
> [19626.543020] calling svc_recv: 0 times (err=212)
> [19626.543059] calling svc_recv: 0 times (err=212)
> [19626.548074] calling svc_recv: 0 times (err=212)
> [19626.549515] calling svc_recv: 0 times (err=212)
> [19626.552320] calling svc_recv: 0 times (err=212)
> [19626.553503] calling svc_recv: 0 times (err=212)
> [19626.556007] calling svc_recv: 0 times (err=212)
> [19626.557152] calling svc_recv: 0 times (err=212)
> [19626.560109] calling svc_recv: 0 times (err=212)
> [19626.560943] calling svc_recv: 0 times (err=212)
> [19626.565315] calling svc_recv: 1067 times (err=212)
> [19626.569735] calling svc_recv: 2571 times (err=212)
> [19626.574150] calling svc_recv: 3842 times (err=212)
> [19626.581914] calling svc_recv: 2891 times (err=212)
> [19626.583072] calling svc_recv: 1247 times (err=212)
> [19626.616885] calling svc_recv: 0 times (err=212)
> [19626.616952] calling svc_recv: 0 times (err=212)
> [19626.622889] calling svc_recv: 0 times (err=212)
> [19626.624518] calling svc_recv: 0 times (err=212)
> [19626.627118] calling svc_recv: 0 times (err=212)
> [19626.629735] calling svc_recv: 0 times (err=212)
> [19626.631777] calling svc_recv: 0 times (err=212)
> [19626.633986] calling svc_recv: 0 times (err=212)
> [19626.636746] calling svc_recv: 0 times (err=212)
> [19626.637692] calling svc_recv: 0 times (err=212)
> [19626.640769] calling svc_recv: 0 times (err=212)
> [19626.657852] calling svc_recv: 0 times (err=212)
> [19626.661602] calling svc_recv: 0 times (err=212)
> [19626.670160] calling svc_recv: 0 times (err=212)
> [19626.671917] calling svc_recv: 0 times (err=212)
> [19626.684643] calling svc_recv: 0 times (err=212)
> [19626.684680] calling svc_recv: 0 times (err=212)
> [19626.812820] calling svc_recv: 0 times (err=212)
> [19626.814697] calling svc_recv: 0 times (err=212)
> [19626.817195] calling svc_recv: 0 times (err=212)
> [19626.820324] calling svc_recv: 0 times (err=212)
> [19626.822855] calling svc_recv: 0 times (err=212)
> [19626.824823] calling svc_recv: 0 times (err=212)
> [19626.828016] calling svc_recv: 0 times (err=212)
> [19626.829021] calling svc_recv: 0 times (err=212)
> [19626.831970] calling svc_recv: 0 times (err=212)
>
> > the stall begin:
> [19686.823135] calling svc_recv: 3670352 times (err=212)
> [19686.823524] calling svc_recv: 3659205 times (err=212)
>
> > transfer continues
> [19686.854734] calling svc_recv: 0 times (err=212)
> [19686.860023] calling svc_recv: 0 times (err=212)
> [19686.887124] calling svc_recv: 0 times (err=212)
> [19686.895532] calling svc_recv: 0 times (err=212)
> [19686.903667] calling svc_recv: 0 times (err=212)
> [19686.922780] calling svc_recv: 0 times (err=212)
>
> So we're calling svc_recv in a tight loop, eating
> all available CPU. (The above is with just 2 nfsd
> threads).
>
> Something is definitely wrong here. And it happens mure more
> often after the mentioned commit (f03d78db65085).

Oh, neat. Hm. That commit doesn't really sound like the cause, then.
Is that busy-looping reproduceable on kernels before that commit?

--b.

2012-08-18 11:13:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Sat, Aug 18, 2012 at 10:49:31AM +0400, Michael Tokarev wrote:
> On 18.08.2012 02:32, J. Bruce Fields wrote:
> > On Fri, Aug 17, 2012 at 04:08:07PM -0400, J. Bruce Fields wrote:
> >> Wait a minute, that assumption's a problem because that calculation
> >> depends in part on xpt_reserved, which is changed here....
> >>
> >> In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
> >> subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
> >> lower xpt_reserved value. That could well explain this.
> >
> > So, maybe something like this?
>
> Well. What can I say? With the change below applied (to 3.2 kernel
> at least), I don't see any stalls or high CPU usage on the server
> anymore. It survived several multi-gigabyte transfers, for several
> hours, without any problem. So it is a good step forward ;)
>
> But the whole thing seems to be quite a bit fragile. I tried to follow
> the logic in there, and the thing is quite a bit, well, "twisted", and
> somewhat difficult to follow. So I don't know if this is the right
> fix or not. At least it works! :)

Suggestions welcomed.

> And I really wonder why no one else reported this problem before.
> Is me the only one in this world who uses linux nfsd? :)

This, for example:

http://marc.info/?l=linux-nfs&m=134131915612287&w=2

may well describe the same problem.... It just needed some debugging
persistence, thanks!

--b.

2012-08-17 19:18:04

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote:
> On 17.08.2012 21:26, Michael Tokarev wrote:
> > On 17.08.2012 21:18, J. Bruce Fields wrote:
> >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> > []
> >>> So we're calling svc_recv in a tight loop, eating
> >>> all available CPU. (The above is with just 2 nfsd
> >>> threads).
> >>>
> >>> Something is definitely wrong here. And it happens mure more
> >>> often after the mentioned commit (f03d78db65085).
> >>
> >> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
> >> Is that busy-looping reproduceable on kernels before that commit?
> >
> > Note I bisected this issue to this commit. I haven't seen it
> > happening before this commit, and reverting it from 3.0 or 3.2
> > kernel makes the problem to go away.
> >
> > I guess it is looping there:
> >
> >
> > net/sunrpc/svc_xprt.c:svc_recv()
> > ...
> > len = 0;
> > ...
> > if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> > ...
> > } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
> > ... len = <something>
> > }
> >
> > /* No data, incomplete (TCP) read, or accept() */
> > if (len == 0 || len == -EAGAIN)
> > goto out;
> > ...
> > out:
> > rqstp->rq_res.len = 0;
> > svc_xprt_release(rqstp);
> > return -EAGAIN;
> > }
> >
> > I'm trying to verify this theory...
>
> Yes. I inserted a printk there, and all these million times while
> we're waiting in this EAGAIN loop, this printk is triggering:
>
> ....
> [21052.533053] svc_recv: !has_wspace
> [21052.533070] svc_recv: !has_wspace
> [21052.533087] svc_recv: !has_wspace
> [21052.533105] svc_recv: !has_wspace
> [21052.533122] svc_recv: !has_wspace
> [21052.533139] svc_recv: !has_wspace
> [21052.533156] svc_recv: !has_wspace
> [21052.533174] svc_recv: !has_wspace
> [21052.533191] svc_recv: !has_wspace
> [21052.533208] svc_recv: !has_wspace
> [21052.533226] svc_recv: !has_wspace
> [21052.533244] svc_recv: !has_wspace
> [21052.533265] calling svc_recv: 1228163 times (err=-4)
> [21052.533403] calling svc_recv: 1226616 times (err=-4)
> [21052.534520] nfsd: last server has exited, flushing export cache
>
> (I stopped nfsd since it was flooding the log).
>
> I can only guess that before that commit, we always had space,
> now we don't anymore, and are looping like crazy.

Thanks! But, arrgh--that should be enough to go on at this point, but
I'm not seeing it. If has_wspace is returning false then it's likely
also returning false to the call at the start of svc_xprt_enqueue() (see
svc_xprt_has_something_to_do), which means the xprt shouldn't be getting
requeued and the next svc_recv call should find no socket ready (so
svc_xprt_dequeue() returns NULL), and goes to sleep.

But clearly it's not working that way....

--b.

2012-08-17 14:56:27

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 05:56:56AM +0400, Michael Tokarev wrote:
> On 12.07.2012 16:53, J. Bruce Fields wrote:
> > On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
> >> I tried to debug this again, maybe to reproduce in a virtual machine,
> >> and found out that it is only 32bit server code shows this issue:
> >> after updating the kernel on the server to 64bit (the same version)
> >> I can't reproduce this issue anymore. Rebooting back to 32bit,
> >> and voila, it is here again.
> >>
> >> Something apparenlty isn't right on 32bits... ;)
> >>
> >> (And yes, the prob is still present and is very annoying :)
> >
> > OK, that's very useful, thanks. So probably a bug got introduced in the
> > 32-bit case between 2.6.32 and 3.0.
> >
> > My personal upstream testing is normally all x86_64 only. I'll kick off
> > a 32-bit install and see if I can reproduce this quickly.
>
> Actually it has nothing to do with 32 vs 64 bits as I
> initially thought. It happens on 64bits too, but takes
> more time (or data to transfer) to trigger.

That makes it sound like some kind of leak: you're hitting this case
eventually either way, but it takes longer in the case where you have
more (low) memory.

I wish I was more familiar with the tcp code.... What number exactly is
being compared against those limits, and how could we watch it from
userspace?

--b.

>
>
> > Let me know if you're able to narrow this down any more.
>
> I bisected this issue to the following commit:
>
> commit f03d78db65085609938fdb686238867e65003181
> Author: Eric Dumazet <[email protected]>
> Date: Thu Jul 7 00:27:05 2011 -0700
>
> net: refine {udp|tcp|sctp}_mem limits
>
> Current tcp/udp/sctp global memory limits are not taking into account
> hugepages allocations, and allow 50% of ram to be used by buffers of a
> single protocol [ not counting space used by sockets / inodes ...]
>
> Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
> per protocol, and a minimum of 128 pages.
> Heavy duty machines sysadmins probably need to tweak limits anyway.
>
>
> Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
> the behavour here.
>
> This machine has 4Gb of memory. On 3.0, with this patch applied
> (as it is part of 3.0), tcp_mem is like this:
>
> 21228 28306 42456
>
> with this patch reverted, tcp_mem shows:
>
> 81216 108288 162432
>
> and with these values, it works fine.
>
> So it looks like something else goes wrong there,
> which lead to all nfsds fighting with each other
> for something and eating 100% of available CPU
> instead of servicing clients.
>
> For added fun, when setting tcp_mem to the "good" value
> from "bad" value (after booting into kernel with that
> patch applied), the problem is _not_ fixed.
>
> Any further hints?
>
> Thanks,
>
> /mjt
>
> >> On 31.05.2012 17:51, Michael Tokarev wrote:
> >>> On 31.05.2012 17:46, Myklebust, Trond wrote:
> >>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
> >>> []
> >>>>> I started tcpdump:
> >>>>>
> >>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
> >>>>>
> >>>>> on the client (192.168.88.2). Next I mounted a directory on the client,
> >>>>> and started reading (tar'ing) a directory into /dev/null. It captured a
> >>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at
> >>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
> >>>>>
> >>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
> >>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
> >>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
> >>>>> take place on disk at all. This also makes it obvious that the issue
> >>>>> does not depend on the speed of disk on the server (since in this case,
> >>>>> the server disk isn't even in use).
> >>>>
> >>>> OK. So from the above file it looks as if the traffic is mainly READ
> >>>> requests.
> >>>
> >>> The issue here happens only with reads.
> >>>
> >>>> In 2 places the server stops responding. In both cases, the client seems
> >>>> to be sending a single TCP frame containing several COMPOUNDS containing
> >>>> READ requests (which should be legal) just prior to the hang. When the
> >>>> server doesn't respond, the client pings it with a RENEW, before it ends
> >>>> up severing the TCP connection and then retransmitting.
> >>>
> >>> And sometimes -- speaking only from the behavour I've seen, not from the
> >>> actual frames sent -- server does not respond to the RENEW too, in which
> >>> case the client reports "nfs server no responding", and on the next
> >>> renew it may actually respond. This happens too, but much more rare.
> >>>
> >>> During these stalls, ie, when there's no network activity at all,
> >>> the server NFSD threads are busy eating all available CPU.
> >>>
> >>> What does it all tell us? :)
> >>>
> >>> Thank you!
> >>>
> >>> /mjt
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>> the body of a message to [email protected]
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>> Please read the FAQ at http://www.tux.org/lkml/
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>

2012-08-17 20:08:10

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On Fri, Aug 17, 2012 at 03:18:00PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 17, 2012 at 09:29:40PM +0400, Michael Tokarev wrote:
> > On 17.08.2012 21:26, Michael Tokarev wrote:
> > > On 17.08.2012 21:18, J. Bruce Fields wrote:
> > >> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> > > []
> > >>> So we're calling svc_recv in a tight loop, eating
> > >>> all available CPU. (The above is with just 2 nfsd
> > >>> threads).
> > >>>
> > >>> Something is definitely wrong here. And it happens mure more
> > >>> often after the mentioned commit (f03d78db65085).
> > >>
> > >> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
> > >> Is that busy-looping reproduceable on kernels before that commit?
> > >
> > > Note I bisected this issue to this commit. I haven't seen it
> > > happening before this commit, and reverting it from 3.0 or 3.2
> > > kernel makes the problem to go away.
> > >
> > > I guess it is looping there:
> > >
> > >
> > > net/sunrpc/svc_xprt.c:svc_recv()
> > > ...
> > > len = 0;
> > > ...
> > > if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> > > ...
> > > } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
> > > ... len = <something>
> > > }
> > >
> > > /* No data, incomplete (TCP) read, or accept() */
> > > if (len == 0 || len == -EAGAIN)
> > > goto out;
> > > ...
> > > out:
> > > rqstp->rq_res.len = 0;
> > > svc_xprt_release(rqstp);
> > > return -EAGAIN;
> > > }
> > >
> > > I'm trying to verify this theory...
> >
> > Yes. I inserted a printk there, and all these million times while
> > we're waiting in this EAGAIN loop, this printk is triggering:
> >
> > ....
> > [21052.533053] svc_recv: !has_wspace
> > [21052.533070] svc_recv: !has_wspace
> > [21052.533087] svc_recv: !has_wspace
> > [21052.533105] svc_recv: !has_wspace
> > [21052.533122] svc_recv: !has_wspace
> > [21052.533139] svc_recv: !has_wspace
> > [21052.533156] svc_recv: !has_wspace
> > [21052.533174] svc_recv: !has_wspace
> > [21052.533191] svc_recv: !has_wspace
> > [21052.533208] svc_recv: !has_wspace
> > [21052.533226] svc_recv: !has_wspace
> > [21052.533244] svc_recv: !has_wspace
> > [21052.533265] calling svc_recv: 1228163 times (err=-4)
> > [21052.533403] calling svc_recv: 1226616 times (err=-4)
> > [21052.534520] nfsd: last server has exited, flushing export cache
> >
> > (I stopped nfsd since it was flooding the log).
> >
> > I can only guess that before that commit, we always had space,
> > now we don't anymore, and are looping like crazy.
>
> Thanks! But, arrgh--that should be enough to go on at this point, but
> I'm not seeing it. If has_wspace is returning false then it's likely
> also returning false to the call at the start of svc_xprt_enqueue()

Wait a minute, that assumption's a problem because that calculation
depends in part on xpt_reserved, which is changed here....

In particular, svc_xprt_release() calls svc_reserve(rqstp, 0), which
subtracts rqstp->rq_reserved and then calls svc_xprt_enqueue, now with a
lower xpt_reserved value. That could well explain this.

--b.

> (see
> svc_xprt_has_something_to_do), which means the xprt shouldn't be getting
> requeued and the next svc_recv call should find no socket ready (so
> svc_xprt_dequeue() returns NULL), and goes to sleep.
>
> But clearly it's not working that way....
>
> --b.

2012-08-17 01:56:59

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 12.07.2012 16:53, J. Bruce Fields wrote:
> On Tue, Jul 10, 2012 at 04:52:03PM +0400, Michael Tokarev wrote:
>> I tried to debug this again, maybe to reproduce in a virtual machine,
>> and found out that it is only 32bit server code shows this issue:
>> after updating the kernel on the server to 64bit (the same version)
>> I can't reproduce this issue anymore. Rebooting back to 32bit,
>> and voila, it is here again.
>>
>> Something apparenlty isn't right on 32bits... ;)
>>
>> (And yes, the prob is still present and is very annoying :)
>
> OK, that's very useful, thanks. So probably a bug got introduced in the
> 32-bit case between 2.6.32 and 3.0.
>
> My personal upstream testing is normally all x86_64 only. I'll kick off
> a 32-bit install and see if I can reproduce this quickly.

Actually it has nothing to do with 32 vs 64 bits as I
initially thought. It happens on 64bits too, but takes
more time (or data to transfer) to trigger.


> Let me know if you're able to narrow this down any more.

I bisected this issue to the following commit:

commit f03d78db65085609938fdb686238867e65003181
Author: Eric Dumazet <[email protected]>
Date: Thu Jul 7 00:27:05 2011 -0700

net: refine {udp|tcp|sctp}_mem limits

Current tcp/udp/sctp global memory limits are not taking into account
hugepages allocations, and allow 50% of ram to be used by buffers of a
single protocol [ not counting space used by sockets / inodes ...]

Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
per protocol, and a minimum of 128 pages.
Heavy duty machines sysadmins probably need to tweak limits anyway.


Reverting this commit on top of 3.0 (or any later 3.x kernel) fixes
the behavour here.

This machine has 4Gb of memory. On 3.0, with this patch applied
(as it is part of 3.0), tcp_mem is like this:

21228 28306 42456

with this patch reverted, tcp_mem shows:

81216 108288 162432

and with these values, it works fine.

So it looks like something else goes wrong there,
which lead to all nfsds fighting with each other
for something and eating 100% of available CPU
instead of servicing clients.

For added fun, when setting tcp_mem to the "good" value
from "bad" value (after booting into kernel with that
patch applied), the problem is _not_ fixed.

Any further hints?

Thanks,

/mjt

>> On 31.05.2012 17:51, Michael Tokarev wrote:
>>> On 31.05.2012 17:46, Myklebust, Trond wrote:
>>>> On Thu, 2012-05-31 at 17:24 +0400, Michael Tokarev wrote:
>>> []
>>>>> I started tcpdump:
>>>>>
>>>>> tcpdump -npvi br0 -s 0 host 192.168.88.4 and \( proto ICMP or port 2049 \) -w nfsdump
>>>>>
>>>>> on the client (192.168.88.2). Next I mounted a directory on the client,
>>>>> and started reading (tar'ing) a directory into /dev/null. It captured a
>>>>> few stalls. Tcpdump shows number of packets it got, the stalls are at
>>>>> packet counts 58090, 97069 and 97071. I cancelled the capture after that.
>>>>>
>>>>> The resulting file is available at http://www.corpit.ru/mjt/tmp/nfsdump.xz ,
>>>>> it is 220Mb uncompressed and 1.3Mb compressed. The source files are
>>>>> 10 files of 1Gb each, all made by using `truncate' utility, so does not
>>>>> take place on disk at all. This also makes it obvious that the issue
>>>>> does not depend on the speed of disk on the server (since in this case,
>>>>> the server disk isn't even in use).
>>>>
>>>> OK. So from the above file it looks as if the traffic is mainly READ
>>>> requests.
>>>
>>> The issue here happens only with reads.
>>>
>>>> In 2 places the server stops responding. In both cases, the client seems
>>>> to be sending a single TCP frame containing several COMPOUNDS containing
>>>> READ requests (which should be legal) just prior to the hang. When the
>>>> server doesn't respond, the client pings it with a RENEW, before it ends
>>>> up severing the TCP connection and then retransmitting.
>>>
>>> And sometimes -- speaking only from the behavour I've seen, not from the
>>> actual frames sent -- server does not respond to the RENEW too, in which
>>> case the client reports "nfs server no responding", and on the next
>>> renew it may actually respond. This happens too, but much more rare.
>>>
>>> During these stalls, ie, when there's no network activity at all,
>>> the server NFSD threads are busy eating all available CPU.
>>>
>>> What does it all tell us? :)
>>>
>>> Thank you!
>>>
>>> /mjt
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2012-08-17 17:29:42

by Michael Tokarev

[permalink] [raw]
Subject: Re: 3.0+ NFS issues (bisected)

On 17.08.2012 21:26, Michael Tokarev wrote:
> On 17.08.2012 21:18, J. Bruce Fields wrote:
>> On Fri, Aug 17, 2012 at 09:12:38PM +0400, Michael Tokarev wrote:
> []
>>> So we're calling svc_recv in a tight loop, eating
>>> all available CPU. (The above is with just 2 nfsd
>>> threads).
>>>
>>> Something is definitely wrong here. And it happens mure more
>>> often after the mentioned commit (f03d78db65085).
>>
>> Oh, neat. Hm. That commit doesn't really sound like the cause, then.
>> Is that busy-looping reproduceable on kernels before that commit?
>
> Note I bisected this issue to this commit. I haven't seen it
> happening before this commit, and reverting it from 3.0 or 3.2
> kernel makes the problem to go away.
>
> I guess it is looping there:
>
>
> net/sunrpc/svc_xprt.c:svc_recv()
> ...
> len = 0;
> ...
> if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> ...
> } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) { <=== here -- has no wspace due to memory...
> ... len = <something>
> }
>
> /* No data, incomplete (TCP) read, or accept() */
> if (len == 0 || len == -EAGAIN)
> goto out;
> ...
> out:
> rqstp->rq_res.len = 0;
> svc_xprt_release(rqstp);
> return -EAGAIN;
> }
>
> I'm trying to verify this theory...

Yes. I inserted a printk there, and all these million times while
we're waiting in this EAGAIN loop, this printk is triggering:

....
[21052.533053] svc_recv: !has_wspace
[21052.533070] svc_recv: !has_wspace
[21052.533087] svc_recv: !has_wspace
[21052.533105] svc_recv: !has_wspace
[21052.533122] svc_recv: !has_wspace
[21052.533139] svc_recv: !has_wspace
[21052.533156] svc_recv: !has_wspace
[21052.533174] svc_recv: !has_wspace
[21052.533191] svc_recv: !has_wspace
[21052.533208] svc_recv: !has_wspace
[21052.533226] svc_recv: !has_wspace
[21052.533244] svc_recv: !has_wspace
[21052.533265] calling svc_recv: 1228163 times (err=-4)
[21052.533403] calling svc_recv: 1226616 times (err=-4)
[21052.534520] nfsd: last server has exited, flushing export cache

(I stopped nfsd since it was flooding the log).

I can only guess that before that commit, we always had space,
now we don't anymore, and are looping like crazy.

Thanks,

/mjt