2018-09-19 12:42:56

by Jäkel, Guido

[permalink] [raw]
Subject: Bump: NFS3 subsystem hung, Kernel alive

RGVhciBORlMgTWFpbnRhaW5lcnMsDQoNCkkgcmVhbGx5IHBsZWFzZSBmb3IgeW91ciBoZWxwIQ0K
DQpJbiB0aGUgbWVhbndoaWxlLCBJIGNoYW5nZWQgdG8gS2VybmVsIDQuMTQuNjEgLCBidXQgdGhl
IGlzc3VlIHJlbWFpbnMuIFllc3RlcmRheSwgb25lIG9mIG91ciB0d28gQmxhZGVzZXJ2ZXJzIHVz
ZWQgZm9yIFByb2R1Y3Rpb24gImZyZWV6ZSIgdHdvIHRpbWVzIHdpdGggYSBnYXAgb2YgYWJvdXQg
MWguIEkgdGhpbmsgdGhhdCB0aGUgZnJlZXplIHdhcyBjYXVzZWQgYnkgYW4gZXZlbnQgaW4gYSBj
dXN0b21lciB1c2VjYXNlIGFuZCBiZWNhdXNlIGl0IHByb2JhYmx5IGZhaWxlZCBpdCB3YXMgdHJp
ZWQgYWdhaW4uIA0KDQpJbiB0aGUgc2l0dWF0aW9uIG9mIGZyZWV6ZSwgdGhlIE5GUyBzdWJzeXN0
ZW0gc3RvcHMgd29ya2luZywgYnV0IGFsbCBvdGhlciB0aGluZ3MgY29udGludWUgdG8gcnVuICJm
aW5lIiAtLSB1cCB0byB0aGUgcG9pbnQgYSBwcm9jZXNzIG5lZWQgdG8gYWNjZXNzIGEgZmlsZSAo
d2hpY2ggaXMgbm90IGluIHRoZSBjYWNoZT8pLiBUaGlzIGlzIGVzcGVjaWFsbHkgYmFkIGJlY2F1
c2UgYWxsIG9mIHRoZSBzZXJ2aWNlIGNoZWNrcyBiYXNlZCBvbiBzaW1wbGUgIm5ldHdvcmsgY29t
bXVuaWNhdGlvbiIgYXJlIHN0aWxsIGdyZWVuLiBZb3UgbWlnaHQgZXZlbiBidWlsZCB1cCBhIHNz
aCBzZXNzaW9uIG9yIGVudGVyIGFuIGxvZ2luIGF0IGNvbnNvbGU6IFRoaXMgd29ya3MgZmluZSB1
cCB0byB0aGUgcG9pbnQgd2hlcmUgdGhlIHVzZXJsYW5kIG5lZWQgdG8gYWNjZXNzIGEgZmlsZSAo
c29tZXRoaW5nIGxpa2UgL2V0Yy97cGFzc3dkLGdyb3VwcyxzaGFkb3d9IG9yIHNvbWUgcmMgZmls
ZXMgdXNlZCBieSB0aGUgc2hlbGwpDQoNClRoZSAibGFzdCBpbmRpcmVjdCBzaWduIG9mIGxpdmUi
IGlzIGEgY2xlYXIgSS9PIHBlYWsgcmVjb3JkZWQgYnkgYSBtb25pdG9yaW5nIHN5c3RlbSAoWmFi
Yml4KTogVGhlcmUgbGFzdCByZWNvcmRlZCBtZWFzdXJpbmcgcG9pbnQgaXMgYWJvdXQgODBNQnBz
ICJJbiItVHJhZmZpYyBvbiB0aGUgTklDIHVzZWQgZm9yIGNvbW11bmljYXRpb24gKGV0aDEpIGFu
ZCBhIGNvcnJlc3BvbmRpbmcgIk91dCItUGVhayBvbiB0aGUgTklDIHVzZWQgZm9yIE5GUyAtLSB3
aXRoIG90aGVyIHdvcmRzIHNvbWV0aGluZyBsaWtlIGEgdXBsb2FkIHdoZXJlIGEgc3RyZWFtIGNv
bWluZyBpbiB2aWEgIm5ldHdvcmsiIGlzIHN0b3JlZCB0byBhIGZpbGUuDQoNCldpdGggZ3JlZXRp
bmdzDQoNCkd1aWRvDQoNCj4tLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPkZyb206IErDpGtl
bCwgR3VpZG8NCj5TZW50OiBGcmlkYXksIEp1bmUgMjIsIDIwMTggMTI6MjcgUE0NCj5UbzogJ0ou
IEJydWNlIEZpZWxkcycgPGJmaWVsZHNAZmllbGRzZXMub3JnPjsgJ0plZmYgTGF5dG9uJyA8amxh
eXRvbkBrZXJuZWwub3JnPg0KPkNjOiAnbGludXgtbmZzQHZnZXIua2VybmVsLm9yZycgPGxpbnV4
LW5mc0B2Z2VyLmtlcm5lbC5vcmc+DQo+U3ViamVjdDogTkZTMyBzdWJzeXN0ZW0gaHVuZywgS2Vy
bmVsIGFsaXZlDQo+DQo+RGVhciBORlMgTWFpbnRhaW5lcnMsDQo+DQo+SSdtIHVzaW5nIGRpc2ts
ZXNzIGJsYWRlc2VydmVycyB3aXRoIFBYRS1Cb290LCBORlMzIGZvciB0aGUgUm9vdEZTIGFuZCBv
dGhlcnMuIFRoaXMgYmxhZGVzZXJ2ZXJzIGFyZSBzdHVmZmVkIHdpdGggTFhDIHRvIHJ1bg0KPmNv
bnRhaW5lcnMgd2l0aCBvdXQgYXBwbGljYXRpb25zLg0KPg0KPkknbSB3YXRjaGluZyBhIGNvbXBs
ZXRlIGZyZWV6ZSBvZiB0aGUgTkZTMyBjbGllbnQuIEl0IGhhcyBoYXBwZW5lZCBmb3Igc29tZSB3
aGlsZSBhYm91dCBvbmUgYSB3ZWVrLiBVc2luZyAiYmluYXJ5IHNwbGl0dGluZyIgb2YNCj50aGUg
d29ya2xvYWQsIGl0IHRha2UgYSBsb3Qgb2YgdGltZSB0byB0cmFjZSBkb3duIGEgdHJpZ2dlci4g
QnV0IHNpbmNlIHllc3RlcmRheSwgSSB3YXMgYWJsZSB0byBpc29sYXRlIGl0LiBCeSBydW5uaW5n
IGFuIHVzZXJzDQo+b3JkaW5hcnksIHVuc3BlY2lhbCBiYXRjaCBqb2IgKGEgInRyYWRpdGlvbmFs
IiBjb21tYW5kIGxpbmUgYmFzaCBzb3J0aW5nIHNvbWUgR0ItbGFyZ2UgZmlsZSB3aXRoIHNvcnQs
IC90bXAgaXMgb24gTkZTLCB0b28pLg0KPkJlY2F1c2UgdGhlIE5GUy1jbGllbnQgZnJlZXplLCB0
aGUgc3lzdGVtIGNhbid0IGxvYWQgYW55IHVuY2FjaGVkIHVzZXJsYW5kIGJpbmFyeSBmb3IgaW5z
cGVjdGlvbi4gTm8gbG9ncyBtYXkgYmUgd3JpdHRlbiBmb3IgdGhlDQo+c2FtZSByZWFzb24uIEJ1
dCB0aGUgc3lzdGVtIGFuZCBrZXJuZWwgaXMgZnVsbC1hbGl2ZSwgaXQgbWF5IGJlIHBpbmdlZCBm
b3IgaW5zdGFuY2UuDQo+DQo+V2UncmUgdXNpbmcgYSB3aG9sZSBidW5jaCBvZiBibGFkZXNlcnZl
cnMgYW5kIHJhY2tzZXJ2ZXJzLCBidXQgdGhlcmUgYXJlIGp1c3QgdGhyZWUgZGlmZmVyZW50IGhh
cmR3YXJlIG1vZGVscyBhdCB0aGUgbW9tZW50LiBUaGUNCj5Jc3N1ZSBvY2N1cnMgb24gYW4gb2xk
ZXIgSUJNIFgzNTUwIHJhY2tzZXJ2ZXIuIFRoaXMgaGF2ZSB0d28gMUdCaXQgb25ib2FkIE5JQ3Mg
KCJFdGhlcm5ldCBjb250cm9sbGVyOiBCcm9hZGNvbSBMaW1pdGVkIE5ldFh0cmVtZQ0KPklJIEJD
TTU3MDkgR2lnYWJpdCBFdGhlcm5ldCAocmV2IDIwKSIpLiBPbmUgb2YgdGhlbSBpcyB1c2VkIGZv
ciB0aGUgTkZTLUZpbGVzeXN0ZW0tSU8sIHRoZSBvdGhlciBmb3IgdGhlIGFwcGxpY2F0aW9uIHRy
YWZmaWMuDQo+V2hpbGUgcGVyZm9ybWluZyB0aGUgbWVyZ2UgcGhhc2Ugb2YgdGhlIHNvcnQsIHRo
ZSAiZmlsZXN5c3RlbSBOSUMiIGlzICJvdmVyYm9va2VkIGF0IGxpbWl0IiwgYmVjYXVzZSB0aGUg
ZXh0ZXJuYWwgTmV0QXBwIE5GUw0KPmZpbGVyIGFsbG93cyBhYm91dCA0MDBNYnl0ZSB3cml0ZSBi
YW5kd2l0aCBldmVuIGFzIHRoZSB3b3JzdCBjYXNlIGxvd2VyIGxpbWl0KQ0KPg0KPkkgd2FzIG5v
dCBhYmxlIHRvIHJlcHJvZHVjZSB0aGUgaHVuZywgaWYgSSBzdGFydCB0aGUgY29ycmVzcG9uZGlu
ZyBjb250YWluZXIgYW5kIHRoZSBqb2Igb24gb3VyIG1haW4gQ2lzY28gVUNTIGJsYWRlcy4gVGhp
cw0KPmJsYWRlIGhhcmR3YXJlIGhhcyAxMEdCaXQgbGluayB0byB0aGUgY2hhc3NpcyBhbmQgYSA0
MEdCaXQgdXBzdHJlYW0gbGlua3MgdG8gdGhlIGNvcmUgc3dpdGNoLiBGcm9tIHRoYXQsIGhlcmUg
dGhlIEZpbGUtSUQtDQo+QmFuZHdpdGggaXMganVzdCBsaW1pdGVkIGJ5IHRoZSBmaWxlci4gQW5k
IHRoZSBOSUMgaGFyZHdhcmUgKENpc2NvIFN5c3RlbXMgSW5jIFZJQyBFdGhlcm5ldCBOSUMpIGFu
ZCBMaW51eCBkcml2ZXIgYXJlIGRpZmZlcmVudCwNCj50b28uIEJ1dCBhbGwgb3RoZXIgbGlrZSBr
ZXJuZWwgaW1hZ2Ugb3IgdXNlZCBzb2Z0d2FyZSBpcyBleGFjdGx5IHRoZSBzYW1lLCBiZWNhdXNl
IHRoaXMgaXMgYWxsIHNoYXJlZCB2aWEgTkZTLg0KPg0KPkkganVzdCB3YXMgYWJsZSB0byB0YWtl
IGEgcGhvdG8gZnJvbSB0aGUgY29uc29sZSBvdXRwdXQgb2YgdGhlIFN5cy1NYWdpYy1Ub29sICh3
KS4gVGhlcmUgYXJlIE5GUyBSUEMgdGFza3Mgd2FpdGluZyBmb3Igc29tZSBiaXQuDQo+QW5kIG9u
ZSByZXN1bHRpbmcgZnJvbSB0aGUgaHVuZ190YXNrX3RpbWVyLg0KPg0KPg0KPkN1cnJlbnQgS2Vy
bmVsOg0KPg0KPglyb290QHhydW5uZXIwIH4gIyB1bmFtZSAtYQ0KPglMaW51eCB4cnVubmVyMCA0
LjE0LjQzLWdlbnRvbyAjMyBTTVAgVGh1IE1heSAyNCAxMjo1ODozMSBDRVNUIDIwMTggeDg2XzY0
IEludGVsKFIpIFhlb24oUikgQ1BVIEU1NTMwIEAgMi40MEdIeg0KPkdlbnVpbmVJbnRlbCBHTlUv
TGludXgNCj4NCj4NCj5Sb290RlMtbW91bnQgZm9yIHRoZSBob3N0aW5nIGJsYWRlOg0KPg0KPgly
b290QHhydW5uZXIwIH4gIyBtb3VudCB8IGdyZXAgIm9uIC8gIg0KPgkxMC42OS5YWFguWFhYOi8w
Mi9xL2Rpc2tsZXNzL3Jvb3RzL3hydW5uZXIwIG9uIC8gdHlwZQ0KPm5mc3YocncscmVsYXRpbWUs
dmVycz0zLHJzaXplPTY1NTM2LHdzaXplPTY1NTM2LG5hbWxlbj0yNTUsaGFyZCxub2xvY2sscHJv
dG89dGNwLHRpbWVvPTYwMCxyZXRyYW5zPTIsc2VjPXN5cyxtb3VudGFkZHI9MTAuNjkuWFhYDQo+
LlhYWCxtb3VudHZlcnM9MywNCj5tb3VudHByb3RvPXRjcCxsb2NhbF9sb2NrPWFsbCxhZGRyPTEw
LjY5LlhYWC5YWFgpDQo+DQo+DQo+DQo+Um9vdEZTLW1vdW50IGZvciB0aGUgY29udGFpbmVyOg0K
Pg0KPglyb290QGV2YWxhZW5lMCB+ICMgbW91bnQgfCBncmVwICJvbiAvICINCj4JbmV0YXBwMjov
MDkvcS9kaXNrbGVzcy9yb290cy9ldmFsYWVuZTAgb24gLyB0eXBlIG5mcw0KPihydyxyZWxhdGlt
ZSx2ZXJzPTMscnNpemU9NjU1MzYsd3NpemU9NjU1MzYsbmFtbGVuPTI1NSxoYXJkLHByb3RvPXRj
cCx0aW1lbz02MDAscmV0cmFucz0yLHNlYz1zeXMsbG9jYWxfbG9jaz1ub25lLGFkZHI9MTAuNjku
WFhYDQo+LlhYWCkNCj4NCj4NCj5UaGUgdGVzdGNhc2U6DQo+DQo+SW5wdXQgZmlsZSBhYm91dCA2
R0Igb2Ygc2hvcnQgbGluZXMsIGl0J3Mgbm90IG11Y2ggZmlsdGVyaW5nIG91dDsgc29ydCB3aWxs
IHdyaXRlIGFib3V0IDZHQiBpbnB1dCBmaWxlcyBhcyAvdG1wL3NvcnQqLCB0b28uIFRoZQ0KPmZy
ZWV6ZSBoYXBwZW5zIHdoaWxlIG1lcmdpbmcgZG93biB0aGUgc29ydCogZmlsZXMuDQo+DQo+CWph
ZWtlbEBldmFsYWVuZTAgfiAkIGNhdCBNYXJjWndpR05ELTEuMi5jc3YgfCBncmVwICJcXlxeWzAt
OV0iIHwgc2VkICdzL1xzL18vZycgfCBzZWQgJ3MvfiMvfiAjL2cnIHwgc2VkICdzL1x+XHMjJC8v
ZycgfA0KPmF3ayAneyBmb3IgKGk9MjtpPD1ORjtpKyspIHtwcmludCAkMSIgIiRpfX0nfCBzZWQg
J3MvXH4kLy9nJyB8IHNlZCAncy9cKC4qXClcflxzXCNcKC4qXClcXlxeXChbMC05WFwtXSpcKSQv
IlwxIjsgOyJcMiI7IDsiXDMiLycNCj58IGdyZXAgIlwiLio7IDtcIi4qXCI7IDtcIlswLTldIiB8
IHNvcnQgLWszIHwgc2VkICdzLzsgOy8gLycgPiBNYXJjWndpR05ELTEub3V0LmNzdg0KPg0KPlVu
Zm9ydHVuYXRlbHksIGF0dGFjaGluZyBzdHJhY2Ugc2VlbSB0byBoaWRlIHRoZSBpc3N1ZS4NCj4N
Cj4NCj5QbGVhc2UgYXNrIGZvciBhbnkgbW9yZSBpbmZvIHlvdSBuZWVkLg0KPg0KPg0KPkdyZWV0
aW5ncw0KPg0KPkd1aWRvDQo+DQo+LS0NCj4qKipMZXNlbi4gSMO2cmVuLiBXaXNzZW4uIERldXRz
Y2hlIE5hdGlvbmFsYmlibGlvdGhlayoqKg0KPg0KPkRyLiBHdWlkbyBKw6RrZWwNCj5EZXV0c2No
ZSBOYXRpb25hbGJpYmxpb3RoZWsNCj5JbmZvcm1hdGlvbnNpbmZyYXN0cnVrdHVyIC8gUmVjaGVu
emVudHJ1bSAvIEluZnJhc3RydWt0dXIgVW5peA0KPkFkaWNrZXNhbGxlZSAxDQo+NjAzMjIgRnJh
bmtmdXJ0IGFtIE1haW4NCj5UZWw6ICs0OSA2OSAxNTI1IC0xNzUwDQo+bWFpbHRvOmcuamFla2Vs
QGRuYi5kZQ0KPmh0dHA6Ly93d3cuZG5iLmRlDQoNCg==


2018-09-20 00:52:46

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Bump: NFS3 subsystem hung, Kernel alive

That just looks hard to debug, unfortunately. Have you tried asking
Netapp, or do you have a support contract for your Linux clients? Was
there an older kernel that worked OK?

--b.

On Wed, Sep 19, 2018 at 06:58:06AM +0000, Jäkel, Guido wrote:
> Dear NFS Maintainers,
>
> I really please for your help!
>
> In the meanwhile, I changed to Kernel 4.14.61 , but the issue remains. Yesterday, one of our two Bladeservers used for Production "freeze" two times with a gap of about 1h. I think that the freeze was caused by an event in a customer usecase and because it probably failed it was tried again.
>
> In the situation of freeze, the NFS subsystem stops working, but all other things continue to run "fine" -- up to the point a process need to access a file (which is not in the cache?). This is especially bad because all of the service checks based on simple "network communication" are still green. You might even build up a ssh session or enter an login at console: This works fine up to the point where the userland need to access a file (something like /etc/{passwd,groups,shadow} or some rc files used by the shell)
>
> The "last indirect sign of live" is a clear I/O peak recorded by a monitoring system (Zabbix): There last recorded measuring point is about 80MBps "In"-Traffic on the NIC used for communication (eth1) and a corresponding "Out"-Peak on the NIC used for NFS -- with other words something like a upload where a stream coming in via "network" is stored to a file.
>
> With greetings
>
> Guido
>
> >-----Original Message-----
> >From: Jäkel, Guido
> >Sent: Friday, June 22, 2018 12:27 PM
> >To: 'J. Bruce Fields' <[email protected]>; 'Jeff Layton' <[email protected]>
> >Cc: '[email protected]' <[email protected]>
> >Subject: NFS3 subsystem hung, Kernel alive
> >
> >Dear NFS Maintainers,
> >
> >I'm using diskless bladeservers with PXE-Boot, NFS3 for the RootFS and others. This bladeservers are stuffed with LXC to run
> >containers with out applications.
> >
> >I'm watching a complete freeze of the NFS3 client. It has happened for some while about one a week. Using "binary splitting" of
> >the workload, it take a lot of time to trace down a trigger. But since yesterday, I was able to isolate it. By running an users
> >ordinary, unspecial batch job (a "traditional" command line bash sorting some GB-large file with sort, /tmp is on NFS, too).
> >Because the NFS-client freeze, the system can't load any uncached userland binary for inspection. No logs may be written for the
> >same reason. But the system and kernel is full-alive, it may be pinged for instance.
> >
> >We're using a whole bunch of bladeservers and rackservers, but there are just three different hardware models at the moment. The
> >Issue occurs on an older IBM X3550 rackserver. This have two 1GBit onboad NICs ("Ethernet controller: Broadcom Limited NetXtreme
> >II BCM5709 Gigabit Ethernet (rev 20)"). One of them is used for the NFS-Filesystem-IO, the other for the application traffic.
> >While performing the merge phase of the sort, the "filesystem NIC" is "overbooked at limit", because the external NetApp NFS
> >filer allows about 400Mbyte write bandwith even as the worst case lower limit)
> >
> >I was not able to reproduce the hung, if I start the corresponding container and the job on our main Cisco UCS blades. This
> >blade hardware has 10GBit link to the chassis and a 40GBit upstream links to the core switch. From that, here the File-ID-
> >Bandwith is just limited by the filer. And the NIC hardware (Cisco Systems Inc VIC Ethernet NIC) and Linux driver are different,
> >too. But all other like kernel image or used software is exactly the same, because this is all shared via NFS.
> >
> >I just was able to take a photo from the console output of the Sys-Magic-Tool (w). There are NFS RPC tasks waiting for some bit.
> >And one resulting from the hung_task_timer.
> >
> >
> >Current Kernel:
> >
> > root@xrunner0 ~ # uname -a
> > Linux xrunner0 4.14.43-gentoo #3 SMP Thu May 24 12:58:31 CEST 2018 x86_64 Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> >GenuineIntel GNU/Linux
> >
> >
> >RootFS-mount for the hosting blade:
> >
> > root@xrunner0 ~ # mount | grep "on / "
> > 10.69.XXX.XXX:/02/q/diskless/roots/xrunner0 on / type
> >nfsv(rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.69.XXX
> >.XXX,mountvers=3,
> >mountproto=tcp,local_lock=all,addr=10.69.XXX.XXX)
> >
> >
> >
> >RootFS-mount for the container:
> >
> > root@evalaene0 ~ # mount | grep "on / "
> > netapp2:/09/q/diskless/roots/evalaene0 on / type nfs
> >(rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none,addr=10.69.XXX
> >.XXX)
> >
> >
> >The testcase:
> >
> >Input file about 6GB of short lines, it's not much filtering out; sort will write about 6GB input files as /tmp/sort*, too. The
> >freeze happens while merging down the sort* files.
> >
> > jaekel@evalaene0 ~ $ cat MarcZwiGND-1.2.csv | grep "\^\^[0-9]" | sed 's/\s/_/g' | sed 's/~#/~ #/g' | sed 's/\~\s#$//g' |
> >awk '{ for (i=2;i<=NF;i++) {print $1" "$i}}'| sed 's/\~$//g' | sed 's/\(.*\)\~\s\#\(.*\)\^\^\([0-9X\-]*\)$/"\1"; ;"\2"; ;"\3"/'
> >| grep "\".*; ;\".*\"; ;\"[0-9]" | sort -k3 | sed 's/; ;/ /' > MarcZwiGND-1.out.csv
> >
> >Unfortunately, attaching strace seem to hide the issue.
> >
> >
> >Please ask for any more info you need.
> >
> >
> >Greetings
> >
> >Guido
> >
> >--
> >***Lesen. Hören. Wissen. Deutsche Nationalbibliothek***
> >
> >Dr. Guido Jäkel
> >Deutsche Nationalbibliothek
> >Informationsinfrastruktur / Rechenzentrum / Infrastruktur Unix
> >Adickesallee 1
> >60322 Frankfurt am Main
> >Tel: +49 69 1525 -1750
> >mailto:[email protected]
> >http://www.dnb.de
>

2018-09-20 11:30:43

by Jäkel, Guido

[permalink] [raw]
Subject: Re: Bump: NFS3 subsystem hung, Kernel alive

Dear Bruce,

thank you for quick response. And yes - it's no an easy one for sure, therefore i really need your kernel gurus expertise!

We have no contract for Linux, it's based on Gentoo Linux at all: We are totaly free to try out anything you ask for. This "datacenter design" now is working for about 8 years using a whole bunch of kernel versions without any problem. The issue *may* have start to appear in 2018Q1, maybe with with changing to LTS 4.14 or with changes concerning the Spectre theme. It have happended two/tree times on different other container hosts this year acting in Test and Approval stage. Because of the shared NFS infrastructure, all using exactly the same kernel image and (template-sourced copies of) the same root image.

This also some older rackservers (for Evaluation stage) with comparable smaller hardware. They just have "1GB"-NICs and there might be a clue that the issue may be forced there by heavy file IO workload. I have to re-check this.

Unfortunately, the Apache email infrastructure is problematic (don't accept some mail encodings), but in the end I was able to create an account and open an issue (https://bugzilla.linux-nfs.org/show_bug.cgi?id=328). But I still can't attach things to this, i just got some "internal error". Therefore, please ask and I'll send it via email. I have a photo of the "SysRq" console showing a locked task and a tcpdump of "port nfs" taken at the last event. I'm may send kernel config file, kernel image or whatever you need.

thank you all in advance

Guido

Kernel history:

root@bladerunner14 ~ # ll /boot{,/_save}/kernel* -t | grep -v old
lrwxrwxrwx 1 root root 38 Sep 18 11:46 /boot/kernel -> kernel-genkernel-x86_64-4.14.65-gentoo
-rw-r--r-- 1 root root 5.3M Sep 18 11:46 /boot/kernel-genkernel-x86_64-4.14.65-gentoo
-rw-r--r-- 1 root root 5.3M Aug 8 12:31 /boot/kernel-genkernel-x86_64-4.14.61-gentoo
-rw-r--r-- 1 root root 5.3M May 24 12:58 /boot/kernel-genkernel-x86_64-4.14.43-gentoo
-rw-r--r-- 1 root root 5.3M Apr 9 12:20 /boot/kernel-genkernel-x86_64-4.14.32-gentoo
-rw-r--r-- 1 root root 4.5M Feb 27 2018 /boot/_save/kernel-genkernel-x86_64-4.9.84-gentoo
-rw-r--r-- 1 root root 4.5M Jan 23 2018 /boot/_save/kernel-genkernel-x86_64-4.9.76-gentoo-r1
-rw-r--r-- 1 root root 4.5M Nov 16 2017 /boot/_save/kernel-genkernel-x86_64-4.9.61-gentoo
-rw-r--r-- 1 root root 4.2M Jan 3 2017 /boot/_save/kernel-genkernel-x86_64-4.4.39-gentoo
-rw-r--r-- 1 root root 4.0M Oct 6 2015 /boot/_save/kernel-genkernel-x86_64-3.14.51-gentoo
-rw-r--r-- 1 root root 4.0M Aug 28 2015 /boot/_save/kernel-genkernel-x86_64-3.14.9-gentoo
-rw-r--r-- 1 root root 3.7M Jul 14 2014 /boot/_save/kernel-genkernel-x86_64-3.10.20-gentoo
-rw-r--r-- 1 root root 3.5M Oct 10 2013 /boot/_save/kernel-genkernel-x86_64-3.8.13-gentoo
-rw-r--r-- 1 root root 3.4M Apr 23 2013 /boot/_save/kernel-genkernel-x86_64-3.3.5-gentoo
-rw-r--r-- 1 root root 3.3M May 16 2012 /boot/_save/kernel-genkernel-x86_64-3.3.4-gentoo
-rw-r--r-- 1 root root 2.7M Dec 1 2010 /boot/_save/kernel-genkernel-x86_64-2.6.34-gentoo-r6

On 19.09.2018 21:13, 'J. Bruce Fields' wrote:
> That just looks hard to debug, unfortunately. Have you tried asking
> Netapp, or do you have a support contract for your Linux clients? Was
> there an older kernel that worked OK?
>