2012-03-07 19:46:55

by Paul Anderson

[permalink] [raw]
Subject: new (to us) kernel panic nfsv4 linux 3.0.12

The following kernel panic occurred on at least 4 compute nodes nearly
simultaneously. It was during unattended operation, so no clue as to
what the server was doing.

The client node was under very heavy CPU load (12 core plus HT with
50-100 jobs running). No swapping, unknown I/O but probably low,
except for the set of slurm jobs that stopped in D state probably due
to the kernel panic.

uname -> Linux c09 3.0.12 #1 SMP Wed Nov 30 19:42:40 EST 2011 x86_64 GNU/Linux

Please let me know what additional information I can provide - thanks!

Paul Anderson
University of Michigan

[1411404.724301] nfs4_reclaim_open_state: Lock reclaim failed!
[1412738.175791] nfs4_reclaim_open_state: Lock reclaim failed!
[1412738.175805] general protection fault: 0000 [#1] SMP
[1412738.176036] CPU 3
[1412738.176112] Modules linked in: binfmt_misc ipmi_msghandler
ipt_ULOG x_tables autofs4 mptctl mptbase dlm configfs dm_crypt nfsd
nfs lockd xfs auth_rpcgss n
[1412738.177205]
[1412738.177297] Pid: 10473, comm: 192.168.1.16-ma Not tainted 3.0.12
#1 Dell C6100 /0D61XP
[1412738.177683] RIP: 0010:[<ffffffffa02a8e00>] [<ffffffffa02a8e00>]
nfs4_do_reclaim+0x1c0/0x560 [nfs]
[1412738.178074] RSP: 0018:ffff88100e651e00 EFLAGS: 00010287
[1412738.178296] RAX: 0000000000000042 RBX: ffff88080dff5380 RCX:
000000000003ffff
[1412738.178606] RDX: ffff88080dff53a0 RSI: 0000000000000082 RDI:
0000000000000246
[1412738.178917] RBP: ffff88100e651e80 R08: 0000000000000000 R09:
0000000000000000
[1412738.179227] R10: 0000000000000006 R11: 0000000000000000 R12:
ffffffffa02b9c00
[1412738.179537] R13: dead000000100100 R14: ffff88100e762a58 R15:
ffff88100e762a00
[1412738.179848] FS: 0000000000000000(0000) GS:ffff88083fc60000(0000)
knlGS:0000000000000000
[1412738.180192] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1412738.180428] CR2: 0000000001c89068 CR3: 000000100534f000 CR4:
00000000000006e0
[1412738.180739] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1412738.181049] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[1412738.181360] Process 192.168.1.16-ma (pid: 10473, threadinfo
ffff88100e650000, task ffff8809a7ca8000)
[1412738.181739] Stack:
[1412738.181847] ffff88080dff53a0 ffff88080dff53c0 ffff8808055cf4b0
ffff8808055cf400
[1412738.182192] ffff88100e762a50 ffff88054ab0b2b0 ffff8808055cf4f8
ffff88100e762a48
[1412738.182538] ffffffffa02b9ec8 ffff880ac2296008 ffff88100e651e80
ffff8808055cf4f0
[1412738.182882] Call Trace:
[1412738.183015] [<ffffffffa02a9424>] nfs4_run_state_manager+0x284/0x420 [nfs]
[1412738.183298] [<ffffffffa02a91a0>] ? nfs4_do_reclaim+0x560/0x560 [nfs]
[1412738.183562] [<ffffffff81080a96>] kthread+0x96/0xa0
[1412738.183771] [<ffffffff815ac124>] kernel_thread_helper+0x4/0x10
[1412738.184927] [<ffffffff81080a00>] ? kthread_worker_fn+0x190/0x190
[1412738.185177] [<ffffffff815ac120>] ? gs_change+0x13/0x13
[1412738.185395] Code: 48 74 50 4d 8b 6d 00 4d 85 ed 75 df e8 2a a5 ee
e0 48 8b 7d a8 e8 41 cf dd e0 4c 8b 6b 20 48 8d 53 20 49 39 d5 74 18
0f 1f 40 00
[1412738.186187] f6 45 18 01 0f 84 6a 03 00 00 4d 8b 6d 00 49 39 d5 75 ec 48
[1412738.186646] RIP [<ffffffffa02a8e00>] nfs4_do_reclaim+0x1c0/0x560 [nfs]
[1412738.186926] RSP <ffff88100e651e00>
[1412738.187353] ---[ end trace 4dbb732d1756f6b1 ]---


2012-03-07 20:49:23

by Myklebust, Trond

[permalink] [raw]
Subject: Re: new (to us) kernel panic nfsv4 linux 3.0.12

T24gV2VkLCAyMDEyLTAzLTA3IGF0IDE0OjQxIC0wNTAwLCBQYXVsIEFuZGVyc29uIHdyb3RlOg0K
PiBUaGUgZm9sbG93aW5nIGtlcm5lbCBwYW5pYyBvY2N1cnJlZCBvbiBhdCBsZWFzdCA0IGNvbXB1
dGUgbm9kZXMgbmVhcmx5DQo+IHNpbXVsdGFuZW91c2x5LiAgSXQgd2FzIGR1cmluZyB1bmF0dGVu
ZGVkIG9wZXJhdGlvbiwgc28gbm8gY2x1ZSBhcyB0bw0KPiB3aGF0IHRoZSBzZXJ2ZXIgd2FzIGRv
aW5nLg0KPiANCj4gVGhlIGNsaWVudCBub2RlIHdhcyB1bmRlciB2ZXJ5IGhlYXZ5IENQVSBsb2Fk
ICgxMiBjb3JlIHBsdXMgSFQgd2l0aA0KPiA1MC0xMDAgam9icyBydW5uaW5nKS4gIE5vIHN3YXBw
aW5nLCB1bmtub3duIEkvTyBidXQgcHJvYmFibHkgbG93LA0KPiBleGNlcHQgZm9yIHRoZSBzZXQg
b2Ygc2x1cm0gam9icyB0aGF0IHN0b3BwZWQgaW4gRCBzdGF0ZSBwcm9iYWJseSBkdWUNCj4gdG8g
dGhlIGtlcm5lbCBwYW5pYy4NCj4gDQo+IHVuYW1lIC0+IExpbnV4IGMwOSAzLjAuMTIgIzEgU01Q
IFdlZCBOb3YgMzAgMTk6NDI6NDAgRVNUIDIwMTEgeDg2XzY0IEdOVS9MaW51eA0KPiANCj4gUGxl
YXNlIGxldCBtZSBrbm93IHdoYXQgYWRkaXRpb25hbCBpbmZvcm1hdGlvbiBJIGNhbiBwcm92aWRl
IC0gdGhhbmtzIQ0KPiANCj4gUGF1bCBBbmRlcnNvbg0KPiBVbml2ZXJzaXR5IG9mIE1pY2hpZ2Fu
DQo+IA0KPiBbMTQxMTQwNC43MjQzMDFdIG5mczRfcmVjbGFpbV9vcGVuX3N0YXRlOiBMb2NrIHJl
Y2xhaW0gZmFpbGVkIQ0KPiBbMTQxMjczOC4xNzU3OTFdIG5mczRfcmVjbGFpbV9vcGVuX3N0YXRl
OiBMb2NrIHJlY2xhaW0gZmFpbGVkIQ0KPiBbMTQxMjczOC4xNzU4MDVdIGdlbmVyYWwgcHJvdGVj
dGlvbiBmYXVsdDogMDAwMCBbIzFdIFNNUA0KPiBbMTQxMjczOC4xNzYwMzZdIENQVSAzDQo+IFsx
NDEyNzM4LjE3NjExMl0gTW9kdWxlcyBsaW5rZWQgaW46IGJpbmZtdF9taXNjIGlwbWlfbXNnaGFu
ZGxlcg0KPiBpcHRfVUxPRyB4X3RhYmxlcyBhdXRvZnM0IG1wdGN0bCBtcHRiYXNlIGRsbSBjb25m
aWdmcyBkbV9jcnlwdCBuZnNkDQo+IG5mcyBsb2NrZCB4ZnMgYXV0aF9ycGNnc3Mgbg0KPiBbMTQx
MjczOC4xNzcyMDVdDQo+IFsxNDEyNzM4LjE3NzI5N10gUGlkOiAxMDQ3MywgY29tbTogMTkyLjE2
OC4xLjE2LW1hIE5vdCB0YWludGVkIDMuMC4xMg0KPiAjMSBEZWxsICAgICBDNjEwMCAgICAgICAv
MEQ2MVhQDQo+IFsxNDEyNzM4LjE3NzY4M10gUklQOiAwMDEwOls8ZmZmZmZmZmZhMDJhOGUwMD5d
ICBbPGZmZmZmZmZmYTAyYThlMDA+XQ0KPiBuZnM0X2RvX3JlY2xhaW0rMHgxYzAvMHg1NjAgW25m
c10NCj4gWzE0MTI3MzguMTc4MDc0XSBSU1A6IDAwMTg6ZmZmZjg4MTAwZTY1MWUwMCAgRUZMQUdT
OiAwMDAxMDI4Nw0KPiBbMTQxMjczOC4xNzgyOTZdIFJBWDogMDAwMDAwMDAwMDAwMDA0MiBSQlg6
IGZmZmY4ODA4MGRmZjUzODAgUkNYOg0KPiAwMDAwMDAwMDAwMDNmZmZmDQo+IFsxNDEyNzM4LjE3
ODYwNl0gUkRYOiBmZmZmODgwODBkZmY1M2EwIFJTSTogMDAwMDAwMDAwMDAwMDA4MiBSREk6DQo+
IDAwMDAwMDAwMDAwMDAyNDYNCj4gWzE0MTI3MzguMTc4OTE3XSBSQlA6IGZmZmY4ODEwMGU2NTFl
ODAgUjA4OiAwMDAwMDAwMDAwMDAwMDAwIFIwOToNCj4gMDAwMDAwMDAwMDAwMDAwMA0KPiBbMTQx
MjczOC4xNzkyMjddIFIxMDogMDAwMDAwMDAwMDAwMDAwNiBSMTE6IDAwMDAwMDAwMDAwMDAwMDAg
UjEyOg0KPiBmZmZmZmZmZmEwMmI5YzAwDQo+IFsxNDEyNzM4LjE3OTUzN10gUjEzOiBkZWFkMDAw
MDAwMTAwMTAwIFIxNDogZmZmZjg4MTAwZTc2MmE1OCBSMTU6DQo+IGZmZmY4ODEwMGU3NjJhMDAN
Cj4gWzE0MTI3MzguMTc5ODQ4XSBGUzogIDAwMDAwMDAwMDAwMDAwMDAoMDAwMCkgR1M6ZmZmZjg4
MDgzZmM2MDAwMCgwMDAwKQ0KPiBrbmxHUzowMDAwMDAwMDAwMDAwMDAwDQo+IFsxNDEyNzM4LjE4
MDE5Ml0gQ1M6ICAwMDEwIERTOiAwMDAwIEVTOiAwMDAwIENSMDogMDAwMDAwMDA4MDA1MDAzYg0K
PiBbMTQxMjczOC4xODA0MjhdIENSMjogMDAwMDAwMDAwMWM4OTA2OCBDUjM6IDAwMDAwMDEwMDUz
NGYwMDAgQ1I0Og0KPiAwMDAwMDAwMDAwMDAwNmUwDQo+IFsxNDEyNzM4LjE4MDczOV0gRFIwOiAw
MDAwMDAwMDAwMDAwMDAwIERSMTogMDAwMDAwMDAwMDAwMDAwMCBEUjI6DQo+IDAwMDAwMDAwMDAw
MDAwMDANCj4gWzE0MTI3MzguMTgxMDQ5XSBEUjM6IDAwMDAwMDAwMDAwMDAwMDAgRFI2OiAwMDAw
MDAwMGZmZmYwZmYwIERSNzoNCj4gMDAwMDAwMDAwMDAwMDQwMA0KPiBbMTQxMjczOC4xODEzNjBd
IFByb2Nlc3MgMTkyLjE2OC4xLjE2LW1hIChwaWQ6IDEwNDczLCB0aHJlYWRpbmZvDQo+IGZmZmY4
ODEwMGU2NTAwMDAsIHRhc2sgZmZmZjg4MDlhN2NhODAwMCkNCj4gWzE0MTI3MzguMTgxNzM5XSBT
dGFjazoNCj4gWzE0MTI3MzguMTgxODQ3XSAgZmZmZjg4MDgwZGZmNTNhMCBmZmZmODgwODBkZmY1
M2MwIGZmZmY4ODA4MDU1Y2Y0YjANCj4gZmZmZjg4MDgwNTVjZjQwMA0KPiBbMTQxMjczOC4xODIx
OTJdICBmZmZmODgxMDBlNzYyYTUwIGZmZmY4ODA1NGFiMGIyYjAgZmZmZjg4MDgwNTVjZjRmOA0K
PiBmZmZmODgxMDBlNzYyYTQ4DQo+IFsxNDEyNzM4LjE4MjUzOF0gIGZmZmZmZmZmYTAyYjllYzgg
ZmZmZjg4MGFjMjI5NjAwOCBmZmZmODgxMDBlNjUxZTgwDQo+IGZmZmY4ODA4MDU1Y2Y0ZjANCj4g
WzE0MTI3MzguMTgyODgyXSBDYWxsIFRyYWNlOg0KPiBbMTQxMjczOC4xODMwMTVdICBbPGZmZmZm
ZmZmYTAyYTk0MjQ+XSBuZnM0X3J1bl9zdGF0ZV9tYW5hZ2VyKzB4Mjg0LzB4NDIwIFtuZnNdDQo+
IFsxNDEyNzM4LjE4MzI5OF0gIFs8ZmZmZmZmZmZhMDJhOTFhMD5dID8gbmZzNF9kb19yZWNsYWlt
KzB4NTYwLzB4NTYwIFtuZnNdDQo+IFsxNDEyNzM4LjE4MzU2Ml0gIFs8ZmZmZmZmZmY4MTA4MGE5
Nj5dIGt0aHJlYWQrMHg5Ni8weGEwDQo+IFsxNDEyNzM4LjE4Mzc3MV0gIFs8ZmZmZmZmZmY4MTVh
YzEyND5dIGtlcm5lbF90aHJlYWRfaGVscGVyKzB4NC8weDEwDQo+IFsxNDEyNzM4LjE4NDkyN10g
IFs8ZmZmZmZmZmY4MTA4MGEwMD5dID8ga3RocmVhZF93b3JrZXJfZm4rMHgxOTAvMHgxOTANCj4g
WzE0MTI3MzguMTg1MTc3XSAgWzxmZmZmZmZmZjgxNWFjMTIwPl0gPyBnc19jaGFuZ2UrMHgxMy8w
eDEzDQo+IFsxNDEyNzM4LjE4NTM5NV0gQ29kZTogNDggNzQgNTAgNGQgOGIgNmQgMDAgNGQgODUg
ZWQgNzUgZGYgZTggMmEgYTUgZWUNCj4gZTAgNDggOGIgN2QgYTggZTggNDEgY2YgZGQgZTAgNGMg
OGIgNmIgMjAgNDggOGQgNTMgMjAgNDkgMzkgZDUgNzQgMTgNCj4gMGYgMWYgNDAgMDANCj4gWzE0
MTI3MzguMTg2MTg3XSAgZjYgNDUgMTggMDEgMGYgODQgNmEgMDMgMDAgMDAgNGQgOGIgNmQgMDAg
NDkgMzkgZDUgNzUgZWMgNDgNCj4gWzE0MTI3MzguMTg2NjQ2XSBSSVAgIFs8ZmZmZmZmZmZhMDJh
OGUwMD5dIG5mczRfZG9fcmVjbGFpbSsweDFjMC8weDU2MCBbbmZzXQ0KPiBbMTQxMjczOC4xODY5
MjZdICBSU1AgPGZmZmY4ODEwMGU2NTFlMDA+DQo+IFsxNDEyNzM4LjE4NzM1M10gLS0tWyBlbmQg
dHJhY2UgNGRiYjczMmQxNzU2ZjZiMSBdLS0tDQoNCjMuMCBrZXJuZWxzIGFyZSBubyBsb25nZXIg
c3VwcG9ydGVkIGFzIHBhcnQgb2YgdGhlIHN0YWJsZSBrZXJuZWwgc2VyaWVzLA0KYW5kIGFyZSB0
aGVyZWZvcmUgbWlzc2luZyBhIG51bWJlciBvZiBidWdmaXhlcy4gUGxlYXNlIHNlZSBpZiB5b3Ug
Y2FuDQpyZXByb2R1Y2UgdGhpcyB1c2luZyBhIG5ld2VyIGtlcm5lbC4NCg0KQ2hlZXJzDQogIFRy
b25kDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoN
Ck5ldEFwcA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQoNCg==

2012-03-07 20:58:12

by Boaz Harrosh

[permalink] [raw]
Subject: Re: new (to us) kernel panic nfsv4 linux 3.0.12

Hi Trond

I had a recent patch to @stable and got a response it was added
both to 3.2 and 3.0

So I think it's 3.1 which is not maintained. But 3.0 and of course
3.2 are still maintained

Cheers
Boaz

On 03/07/2012 12:49 PM, Myklebust, Trond wrote:
> On Wed, 2012-03-07 at 14:41 -0500, Paul Anderson wrote:
>> The following kernel panic occurred on at least 4 compute nodes nearly
>> simultaneously. It was during unattended operation, so no clue as to
>> what the server was doing.
>>
>> The client node was under very heavy CPU load (12 core plus HT with
>> 50-100 jobs running). No swapping, unknown I/O but probably low,
>> except for the set of slurm jobs that stopped in D state probably due
>> to the kernel panic.
>>
>> uname -> Linux c09 3.0.12 #1 SMP Wed Nov 30 19:42:40 EST 2011 x86_64 GNU/Linux
>>
>> Please let me know what additional information I can provide - thanks!
>>
>> Paul Anderson
>> University of Michigan
>>
>> [1411404.724301] nfs4_reclaim_open_state: Lock reclaim failed!
>> [1412738.175791] nfs4_reclaim_open_state: Lock reclaim failed!
>> [1412738.175805] general protection fault: 0000 [#1] SMP
>> [1412738.176036] CPU 3
>> [1412738.176112] Modules linked in: binfmt_misc ipmi_msghandler
>> ipt_ULOG x_tables autofs4 mptctl mptbase dlm configfs dm_crypt nfsd
>> nfs lockd xfs auth_rpcgss n
>> [1412738.177205]
>> [1412738.177297] Pid: 10473, comm: 192.168.1.16-ma Not tainted 3.0.12
>> #1 Dell C6100 /0D61XP
>> [1412738.177683] RIP: 0010:[<ffffffffa02a8e00>] [<ffffffffa02a8e00>]
>> nfs4_do_reclaim+0x1c0/0x560 [nfs]
>> [1412738.178074] RSP: 0018:ffff88100e651e00 EFLAGS: 00010287
>> [1412738.178296] RAX: 0000000000000042 RBX: ffff88080dff5380 RCX:
>> 000000000003ffff
>> [1412738.178606] RDX: ffff88080dff53a0 RSI: 0000000000000082 RDI:
>> 0000000000000246
>> [1412738.178917] RBP: ffff88100e651e80 R08: 0000000000000000 R09:
>> 0000000000000000
>> [1412738.179227] R10: 0000000000000006 R11: 0000000000000000 R12:
>> ffffffffa02b9c00
>> [1412738.179537] R13: dead000000100100 R14: ffff88100e762a58 R15:
>> ffff88100e762a00
>> [1412738.179848] FS: 0000000000000000(0000) GS:ffff88083fc60000(0000)
>> knlGS:0000000000000000
>> [1412738.180192] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [1412738.180428] CR2: 0000000001c89068 CR3: 000000100534f000 CR4:
>> 00000000000006e0
>> [1412738.180739] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [1412738.181049] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [1412738.181360] Process 192.168.1.16-ma (pid: 10473, threadinfo
>> ffff88100e650000, task ffff8809a7ca8000)
>> [1412738.181739] Stack:
>> [1412738.181847] ffff88080dff53a0 ffff88080dff53c0 ffff8808055cf4b0
>> ffff8808055cf400
>> [1412738.182192] ffff88100e762a50 ffff88054ab0b2b0 ffff8808055cf4f8
>> ffff88100e762a48
>> [1412738.182538] ffffffffa02b9ec8 ffff880ac2296008 ffff88100e651e80
>> ffff8808055cf4f0
>> [1412738.182882] Call Trace:
>> [1412738.183015] [<ffffffffa02a9424>] nfs4_run_state_manager+0x284/0x420 [nfs]
>> [1412738.183298] [<ffffffffa02a91a0>] ? nfs4_do_reclaim+0x560/0x560 [nfs]
>> [1412738.183562] [<ffffffff81080a96>] kthread+0x96/0xa0
>> [1412738.183771] [<ffffffff815ac124>] kernel_thread_helper+0x4/0x10
>> [1412738.184927] [<ffffffff81080a00>] ? kthread_worker_fn+0x190/0x190
>> [1412738.185177] [<ffffffff815ac120>] ? gs_change+0x13/0x13
>> [1412738.185395] Code: 48 74 50 4d 8b 6d 00 4d 85 ed 75 df e8 2a a5 ee
>> e0 48 8b 7d a8 e8 41 cf dd e0 4c 8b 6b 20 48 8d 53 20 49 39 d5 74 18
>> 0f 1f 40 00
>> [1412738.186187] f6 45 18 01 0f 84 6a 03 00 00 4d 8b 6d 00 49 39 d5 75 ec 48
>> [1412738.186646] RIP [<ffffffffa02a8e00>] nfs4_do_reclaim+0x1c0/0x560 [nfs]
>> [1412738.186926] RSP <ffff88100e651e00>
>> [1412738.187353] ---[ end trace 4dbb732d1756f6b1 ]---
>
> 3.0 kernels are no longer supported as part of the stable kernel series,
> and are therefore missing a number of bugfixes. Please see if you can
> reproduce this using a newer kernel.
>
> Cheers
> Trond


2012-03-07 21:11:29

by Myklebust, Trond

[permalink] [raw]
Subject: Re: new (to us) kernel panic nfsv4 linux 3.0.12

T24gV2VkLCAyMDEyLTAzLTA3IGF0IDE1OjUzIC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
T24gTWFyIDcsIDIwMTIsIGF0IDM6NDkgUE0sIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+IA0K
PiA+IE9uIFdlZCwgMjAxMi0wMy0wNyBhdCAxNDo0MSAtMDUwMCwgUGF1bCBBbmRlcnNvbiB3cm90
ZToNCj4gPj4gVGhlIGZvbGxvd2luZyBrZXJuZWwgcGFuaWMgb2NjdXJyZWQgb24gYXQgbGVhc3Qg
NCBjb21wdXRlIG5vZGVzIG5lYXJseQ0KPiA+PiBzaW11bHRhbmVvdXNseS4gIEl0IHdhcyBkdXJp
bmcgdW5hdHRlbmRlZCBvcGVyYXRpb24sIHNvIG5vIGNsdWUgYXMgdG8NCj4gPj4gd2hhdCB0aGUg
c2VydmVyIHdhcyBkb2luZy4NCj4gPj4gDQo+ID4+IFRoZSBjbGllbnQgbm9kZSB3YXMgdW5kZXIg
dmVyeSBoZWF2eSBDUFUgbG9hZCAoMTIgY29yZSBwbHVzIEhUIHdpdGgNCj4gPj4gNTAtMTAwIGpv
YnMgcnVubmluZykuICBObyBzd2FwcGluZywgdW5rbm93biBJL08gYnV0IHByb2JhYmx5IGxvdywN
Cj4gPj4gZXhjZXB0IGZvciB0aGUgc2V0IG9mIHNsdXJtIGpvYnMgdGhhdCBzdG9wcGVkIGluIEQg
c3RhdGUgcHJvYmFibHkgZHVlDQo+ID4+IHRvIHRoZSBrZXJuZWwgcGFuaWMuDQo+ID4+IA0KPiA+
PiB1bmFtZSAtPiBMaW51eCBjMDkgMy4wLjEyICMxIFNNUCBXZWQgTm92IDMwIDE5OjQyOjQwIEVT
VCAyMDExIHg4Nl82NCBHTlUvTGludXgNCj4gPj4gDQo+ID4+IFBsZWFzZSBsZXQgbWUga25vdyB3
aGF0IGFkZGl0aW9uYWwgaW5mb3JtYXRpb24gSSBjYW4gcHJvdmlkZSAtIHRoYW5rcyENCj4gPj4g
DQo+ID4+IFBhdWwgQW5kZXJzb24NCj4gPj4gVW5pdmVyc2l0eSBvZiBNaWNoaWdhbg0KPiA+PiAN
Cj4gPj4gWzE0MTE0MDQuNzI0MzAxXSBuZnM0X3JlY2xhaW1fb3Blbl9zdGF0ZTogTG9jayByZWNs
YWltIGZhaWxlZCENCj4gPj4gWzE0MTI3MzguMTc1NzkxXSBuZnM0X3JlY2xhaW1fb3Blbl9zdGF0
ZTogTG9jayByZWNsYWltIGZhaWxlZCENCj4gPj4gWzE0MTI3MzguMTc1ODA1XSBnZW5lcmFsIHBy
b3RlY3Rpb24gZmF1bHQ6IDAwMDAgWyMxXSBTTVANCj4gPj4gWzE0MTI3MzguMTc2MDM2XSBDUFUg
Mw0KPiA+PiBbMTQxMjczOC4xNzYxMTJdIE1vZHVsZXMgbGlua2VkIGluOiBiaW5mbXRfbWlzYyBp
cG1pX21zZ2hhbmRsZXINCj4gPj4gaXB0X1VMT0cgeF90YWJsZXMgYXV0b2ZzNCBtcHRjdGwgbXB0
YmFzZSBkbG0gY29uZmlnZnMgZG1fY3J5cHQgbmZzZA0KPiA+PiBuZnMgbG9ja2QgeGZzIGF1dGhf
cnBjZ3NzIG4NCj4gPj4gWzE0MTI3MzguMTc3MjA1XQ0KPiA+PiBbMTQxMjczOC4xNzcyOTddIFBp
ZDogMTA0NzMsIGNvbW06IDE5Mi4xNjguMS4xNi1tYSBOb3QgdGFpbnRlZCAzLjAuMTINCj4gPj4g
IzEgRGVsbCAgICAgQzYxMDAgICAgICAgLzBENjFYUA0KPiA+PiBbMTQxMjczOC4xNzc2ODNdIFJJ
UDogMDAxMDpbPGZmZmZmZmZmYTAyYThlMDA+XSAgWzxmZmZmZmZmZmEwMmE4ZTAwPl0NCj4gPj4g
bmZzNF9kb19yZWNsYWltKzB4MWMwLzB4NTYwIFtuZnNdDQo+ID4+IFsxNDEyNzM4LjE3ODA3NF0g
UlNQOiAwMDE4OmZmZmY4ODEwMGU2NTFlMDAgIEVGTEFHUzogMDAwMTAyODcNCj4gPj4gWzE0MTI3
MzguMTc4Mjk2XSBSQVg6IDAwMDAwMDAwMDAwMDAwNDIgUkJYOiBmZmZmODgwODBkZmY1MzgwIFJD
WDoNCj4gPj4gMDAwMDAwMDAwMDAzZmZmZg0KPiA+PiBbMTQxMjczOC4xNzg2MDZdIFJEWDogZmZm
Zjg4MDgwZGZmNTNhMCBSU0k6IDAwMDAwMDAwMDAwMDAwODIgUkRJOg0KPiA+PiAwMDAwMDAwMDAw
MDAwMjQ2DQo+ID4+IFsxNDEyNzM4LjE3ODkxN10gUkJQOiBmZmZmODgxMDBlNjUxZTgwIFIwODog
MDAwMDAwMDAwMDAwMDAwMCBSMDk6DQo+ID4+IDAwMDAwMDAwMDAwMDAwMDANCj4gPj4gWzE0MTI3
MzguMTc5MjI3XSBSMTA6IDAwMDAwMDAwMDAwMDAwMDYgUjExOiAwMDAwMDAwMDAwMDAwMDAwIFIx
MjoNCj4gPj4gZmZmZmZmZmZhMDJiOWMwMA0KPiA+PiBbMTQxMjczOC4xNzk1MzddIFIxMzogZGVh
ZDAwMDAwMDEwMDEwMCBSMTQ6IGZmZmY4ODEwMGU3NjJhNTggUjE1Og0KPiA+PiBmZmZmODgxMDBl
NzYyYTAwDQo+ID4+IFsxNDEyNzM4LjE3OTg0OF0gRlM6ICAwMDAwMDAwMDAwMDAwMDAwKDAwMDAp
IEdTOmZmZmY4ODA4M2ZjNjAwMDAoMDAwMCkNCj4gPj4ga25sR1M6MDAwMDAwMDAwMDAwMDAwMA0K
PiA+PiBbMTQxMjczOC4xODAxOTJdIENTOiAgMDAxMCBEUzogMDAwMCBFUzogMDAwMCBDUjA6IDAw
MDAwMDAwODAwNTAwM2INCj4gPj4gWzE0MTI3MzguMTgwNDI4XSBDUjI6IDAwMDAwMDAwMDFjODkw
NjggQ1IzOiAwMDAwMDAxMDA1MzRmMDAwIENSNDoNCj4gPj4gMDAwMDAwMDAwMDAwMDZlMA0KPiA+
PiBbMTQxMjczOC4xODA3MzldIERSMDogMDAwMDAwMDAwMDAwMDAwMCBEUjE6IDAwMDAwMDAwMDAw
MDAwMDAgRFIyOg0KPiA+PiAwMDAwMDAwMDAwMDAwMDAwDQo+ID4+IFsxNDEyNzM4LjE4MTA0OV0g
RFIzOiAwMDAwMDAwMDAwMDAwMDAwIERSNjogMDAwMDAwMDBmZmZmMGZmMCBEUjc6DQo+ID4+IDAw
MDAwMDAwMDAwMDA0MDANCj4gPj4gWzE0MTI3MzguMTgxMzYwXSBQcm9jZXNzIDE5Mi4xNjguMS4x
Ni1tYSAocGlkOiAxMDQ3MywgdGhyZWFkaW5mbw0KPiA+PiBmZmZmODgxMDBlNjUwMDAwLCB0YXNr
IGZmZmY4ODA5YTdjYTgwMDApDQo+ID4+IFsxNDEyNzM4LjE4MTczOV0gU3RhY2s6DQo+ID4+IFsx
NDEyNzM4LjE4MTg0N10gIGZmZmY4ODA4MGRmZjUzYTAgZmZmZjg4MDgwZGZmNTNjMCBmZmZmODgw
ODA1NWNmNGIwDQo+ID4+IGZmZmY4ODA4MDU1Y2Y0MDANCj4gPj4gWzE0MTI3MzguMTgyMTkyXSAg
ZmZmZjg4MTAwZTc2MmE1MCBmZmZmODgwNTRhYjBiMmIwIGZmZmY4ODA4MDU1Y2Y0ZjgNCj4gPj4g
ZmZmZjg4MTAwZTc2MmE0OA0KPiA+PiBbMTQxMjczOC4xODI1MzhdICBmZmZmZmZmZmEwMmI5ZWM4
IGZmZmY4ODBhYzIyOTYwMDggZmZmZjg4MTAwZTY1MWU4MA0KPiA+PiBmZmZmODgwODA1NWNmNGYw
DQo+ID4+IFsxNDEyNzM4LjE4Mjg4Ml0gQ2FsbCBUcmFjZToNCj4gPj4gWzE0MTI3MzguMTgzMDE1
XSAgWzxmZmZmZmZmZmEwMmE5NDI0Pl0gbmZzNF9ydW5fc3RhdGVfbWFuYWdlcisweDI4NC8weDQy
MCBbbmZzXQ0KPiA+PiBbMTQxMjczOC4xODMyOThdICBbPGZmZmZmZmZmYTAyYTkxYTA+XSA/IG5m
czRfZG9fcmVjbGFpbSsweDU2MC8weDU2MCBbbmZzXQ0KPiA+PiBbMTQxMjczOC4xODM1NjJdICBb
PGZmZmZmZmZmODEwODBhOTY+XSBrdGhyZWFkKzB4OTYvMHhhMA0KPiA+PiBbMTQxMjczOC4xODM3
NzFdICBbPGZmZmZmZmZmODE1YWMxMjQ+XSBrZXJuZWxfdGhyZWFkX2hlbHBlcisweDQvMHgxMA0K
PiA+PiBbMTQxMjczOC4xODQ5MjddICBbPGZmZmZmZmZmODEwODBhMDA+XSA/IGt0aHJlYWRfd29y
a2VyX2ZuKzB4MTkwLzB4MTkwDQo+ID4+IFsxNDEyNzM4LjE4NTE3N10gIFs8ZmZmZmZmZmY4MTVh
YzEyMD5dID8gZ3NfY2hhbmdlKzB4MTMvMHgxMw0KPiA+PiBbMTQxMjczOC4xODUzOTVdIENvZGU6
IDQ4IDc0IDUwIDRkIDhiIDZkIDAwIDRkIDg1IGVkIDc1IGRmIGU4IDJhIGE1IGVlDQo+ID4+IGUw
IDQ4IDhiIDdkIGE4IGU4IDQxIGNmIGRkIGUwIDRjIDhiIDZiIDIwIDQ4IDhkIDUzIDIwIDQ5IDM5
IGQ1IDc0IDE4DQo+ID4+IDBmIDFmIDQwIDAwDQo+ID4+IFsxNDEyNzM4LjE4NjE4N10gIGY2IDQ1
IDE4IDAxIDBmIDg0IDZhIDAzIDAwIDAwIDRkIDhiIDZkIDAwIDQ5IDM5IGQ1IDc1IGVjIDQ4DQo+
ID4+IFsxNDEyNzM4LjE4NjY0Nl0gUklQICBbPGZmZmZmZmZmYTAyYThlMDA+XSBuZnM0X2RvX3Jl
Y2xhaW0rMHgxYzAvMHg1NjAgW25mc10NCj4gPj4gWzE0MTI3MzguMTg2OTI2XSAgUlNQIDxmZmZm
ODgxMDBlNjUxZTAwPg0KPiA+PiBbMTQxMjczOC4xODczNTNdIC0tLVsgZW5kIHRyYWNlIDRkYmI3
MzJkMTc1NmY2YjEgXS0tLQ0KPiA+IA0KPiA+IDMuMCBrZXJuZWxzIGFyZSBubyBsb25nZXIgc3Vw
cG9ydGVkIGFzIHBhcnQgb2YgdGhlIHN0YWJsZSBrZXJuZWwgc2VyaWVzLA0KPiANCj4gSSB0aG91
Z2h0IEkganVzdCBzYXcgR3JlZyBLSCBwb3N0IGFuIGUtbWFpbCBjYWxsaW5nIGZvciBldmVyeW9u
ZSB0byBtb3ZlIHRvIDMuMC4NCg0KT29wcy4uIFlvdSBhcmUgcmlnaHQuIEkgc2VlIHRoYXQgdGhl
IGJ1ZyBJIHN1c3BlY3QgaXMgYmVpbmcgaGl0IGFib3ZlDQp3YXMgc3ViamVjdCB0byBhIHBhdGNo
IHRoYXQgZGlkbid0IGdvIHRocm91Z2ggc3RhYmxlLg0KaHR0cDovL2dpdC5rZXJuZWwub3JnLz9w
PWxpbnV4L2tlcm5lbC9naXQvdG9ydmFsZHMvbGludXgtMi42LmdpdCZhPWNvbW1pdGRpZmYmaD00
YjQ0YjQwZTA0YTc1OGUyMjQyZmY0YTNmN2MxNTk4MjgwMWVjOGJjDQoNCg0KDQotLSANClRyb25k
IE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQu
TXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQoNCg==

2012-03-07 20:53:25

by Chuck Lever III

[permalink] [raw]
Subject: Re: new (to us) kernel panic nfsv4 linux 3.0.12


On Mar 7, 2012, at 3:49 PM, Myklebust, Trond wrote:

> On Wed, 2012-03-07 at 14:41 -0500, Paul Anderson wrote:
>> The following kernel panic occurred on at least 4 compute nodes nearly
>> simultaneously. It was during unattended operation, so no clue as to
>> what the server was doing.
>>
>> The client node was under very heavy CPU load (12 core plus HT with
>> 50-100 jobs running). No swapping, unknown I/O but probably low,
>> except for the set of slurm jobs that stopped in D state probably due
>> to the kernel panic.
>>
>> uname -> Linux c09 3.0.12 #1 SMP Wed Nov 30 19:42:40 EST 2011 x86_64 GNU/Linux
>>
>> Please let me know what additional information I can provide - thanks!
>>
>> Paul Anderson
>> University of Michigan
>>
>> [1411404.724301] nfs4_reclaim_open_state: Lock reclaim failed!
>> [1412738.175791] nfs4_reclaim_open_state: Lock reclaim failed!
>> [1412738.175805] general protection fault: 0000 [#1] SMP
>> [1412738.176036] CPU 3
>> [1412738.176112] Modules linked in: binfmt_misc ipmi_msghandler
>> ipt_ULOG x_tables autofs4 mptctl mptbase dlm configfs dm_crypt nfsd
>> nfs lockd xfs auth_rpcgss n
>> [1412738.177205]
>> [1412738.177297] Pid: 10473, comm: 192.168.1.16-ma Not tainted 3.0.12
>> #1 Dell C6100 /0D61XP
>> [1412738.177683] RIP: 0010:[<ffffffffa02a8e00>] [<ffffffffa02a8e00>]
>> nfs4_do_reclaim+0x1c0/0x560 [nfs]
>> [1412738.178074] RSP: 0018:ffff88100e651e00 EFLAGS: 00010287
>> [1412738.178296] RAX: 0000000000000042 RBX: ffff88080dff5380 RCX:
>> 000000000003ffff
>> [1412738.178606] RDX: ffff88080dff53a0 RSI: 0000000000000082 RDI:
>> 0000000000000246
>> [1412738.178917] RBP: ffff88100e651e80 R08: 0000000000000000 R09:
>> 0000000000000000
>> [1412738.179227] R10: 0000000000000006 R11: 0000000000000000 R12:
>> ffffffffa02b9c00
>> [1412738.179537] R13: dead000000100100 R14: ffff88100e762a58 R15:
>> ffff88100e762a00
>> [1412738.179848] FS: 0000000000000000(0000) GS:ffff88083fc60000(0000)
>> knlGS:0000000000000000
>> [1412738.180192] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [1412738.180428] CR2: 0000000001c89068 CR3: 000000100534f000 CR4:
>> 00000000000006e0
>> [1412738.180739] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [1412738.181049] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [1412738.181360] Process 192.168.1.16-ma (pid: 10473, threadinfo
>> ffff88100e650000, task ffff8809a7ca8000)
>> [1412738.181739] Stack:
>> [1412738.181847] ffff88080dff53a0 ffff88080dff53c0 ffff8808055cf4b0
>> ffff8808055cf400
>> [1412738.182192] ffff88100e762a50 ffff88054ab0b2b0 ffff8808055cf4f8
>> ffff88100e762a48
>> [1412738.182538] ffffffffa02b9ec8 ffff880ac2296008 ffff88100e651e80
>> ffff8808055cf4f0
>> [1412738.182882] Call Trace:
>> [1412738.183015] [<ffffffffa02a9424>] nfs4_run_state_manager+0x284/0x420 [nfs]
>> [1412738.183298] [<ffffffffa02a91a0>] ? nfs4_do_reclaim+0x560/0x560 [nfs]
>> [1412738.183562] [<ffffffff81080a96>] kthread+0x96/0xa0
>> [1412738.183771] [<ffffffff815ac124>] kernel_thread_helper+0x4/0x10
>> [1412738.184927] [<ffffffff81080a00>] ? kthread_worker_fn+0x190/0x190
>> [1412738.185177] [<ffffffff815ac120>] ? gs_change+0x13/0x13
>> [1412738.185395] Code: 48 74 50 4d 8b 6d 00 4d 85 ed 75 df e8 2a a5 ee
>> e0 48 8b 7d a8 e8 41 cf dd e0 4c 8b 6b 20 48 8d 53 20 49 39 d5 74 18
>> 0f 1f 40 00
>> [1412738.186187] f6 45 18 01 0f 84 6a 03 00 00 4d 8b 6d 00 49 39 d5 75 ec 48
>> [1412738.186646] RIP [<ffffffffa02a8e00>] nfs4_do_reclaim+0x1c0/0x560 [nfs]
>> [1412738.186926] RSP <ffff88100e651e00>
>> [1412738.187353] ---[ end trace 4dbb732d1756f6b1 ]---
>
> 3.0 kernels are no longer supported as part of the stable kernel series,

I thought I just saw Greg KH post an e-mail calling for everyone to move to 3.0.

> and are therefore missing a number of bugfixes. Please see if you can
> reproduce this using a newer kernel.
>
> Cheers
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com