2006-09-08 07:31:37

by Andre Noll

[permalink] [raw]
Subject: 2.6.18-rc5 page_to_pfn: Unable to handle kernel NULL pointer dereference

The following just happend to one of our 8-way Opteron cluster nodes
(nfs client version 3, solaris nfs server).

Just let me know if you need further info.
Andre

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[<ffffffff80150c09>] page_to_pfn+0x0/0x33
PGD 74ba36067 PUD 0
Oops: 0000 [1] SMP
CPU 0
Pid: 1782, comm: sge_execd Not tainted 2.6.18-rc5-tt64-6-gd9629953 #23
RIP: 0010:[<ffffffff80150c09>] [<ffffffff80150c09>] page_to_pfn+0x0/0x33
RSP: 0018:ffff81074b99bbb0 EFLAGS: 00010287
RAX: 0000000000000b9a RBX: 0000000000000734 RCX: 0000000000000b9a
RDX: ffff81074b99bbf0 RSI: ffff810812320500 RDI: 0000000000000000
RBP: 0000000000000466 R08: ffff8104c53a89a0 R09: ffff8104c53a8810
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000008cc
R13: ffff81074b99bbf8 R14: 0000000000002000 R15: ffff8104c53a8ab8
FS: 00002b75f92987a0(0000) GS:ffffffff80703000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000080b603000 CR4: 00000000000006a0
Process sge_execd (pid: 1782, threadinfo ffff81074b99a000, task ffff81080bc50040)
Stack: ffffffff80205a72 ffff8104c53a89a0 ffff810812320380 0000000000000b9a
ffff8104c53a89a0 0000000000000466 ffffffff80205cf8 0000000000000000
ffff810817df9678 0000000000000000 ffff8104c53a89a0 ffff810817df9678
Call Trace:
[<ffffffff80205a72>] nfs_readpage_truncate_uninitialised_page+0x77/0xec
[<ffffffff80205cf8>] nfs_readpage_sync+0x211/0x255
[<ffffffff8020677e>] nfs_readpage+0x118/0x151
[<ffffffff8014c65e>] do_generic_mapping_read+0x1ec/0x398
[<ffffffff8014c80a>] file_read_actor+0x0/0xd1
[<ffffffff8014ca52>] __generic_file_aio_read+0x177/0x1b0
[<ffffffff8014cabf>] generic_file_aio_read+0x34/0x39
[<ffffffff801fea68>] nfs_file_read+0xb1/0xc0
[<ffffffff8016df5c>] do_sync_read+0xc9/0x106
[<ffffffff8013e541>] autoremove_wake_function+0x0/0x2e
[<ffffffff8015bd8e>] do_mmap_pgoff+0x5fd/0x6de
[<ffffffff8016e046>] vfs_read+0xad/0x14c
[<ffffffff8016e380>] sys_read+0x45/0x6e
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 48 8b 07 48 c1 e8 3a 48 8b 14 c5 00 c8 70 80 48 b8 b7 6d db
RIP [<ffffffff80150c09>] page_to_pfn+0x0/0x33
RSP <ffff81074b99bbb0>
CR2: 0000000000000000

--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (0.00 B)
(No filename) (373.00 B)
(No filename) (140.00 B)
Download all attachments

2006-09-15 12:41:12

by Andre Noll

[permalink] [raw]
Subject: Re: 2.6.18-rc5 page_to_pfn: Unable to handle kernel NULL pointer dereference

On 09:31, Andre Noll wrote:
> The following just happend to one of our 8-way Opteron cluster nodes
> (nfs client version 3, solaris nfs server).

The problem is still present in 2.6.18-rc7. This time it happend on a
2-processor Opteron machine:

Pid: 945, comm: sge_execd Not tainted 2.6.18-rc7-tt64-6-g1883c5ab #4
RIP: 0010:[<ffffffff80150cad>] [<ffffffff80150cad>] page_to_pfn+0x0/0x33
RSP: 0018:ffff8100fac73bb0 EFLAGS: 00010283
RAX: 0000000000000a57 RBX: 00000000000004ae RCX: 0000000000000a57
RDX: ffff8100fac73bf0 RSI: ffff8101b9b1c540 RDI: 0000000000000000
RBP: 00000000000005a9 R08: ffff8101f7aa31d0 R09: ffff8101f7aa3040
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000b52
R13: ffff8100fac73bf8 R14: 0000000000002000 R15: ffff8101f7aa32e8
FS: 00002b83ded037a0(0000) GS:ffff8101000dc540(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000fad78000 CR4: 00000000000006a0
Process sge_execd (pid: 945, threadinfo ffff8100fac72000, task ffff810004be5180)
Stack: ffffffff801d6276 ffff8101fc4f0440 ffff8101b9b1c3c0 0000000000000a57
ffff8101f7aa31d0 00000000000005a9 ffffffff801d64fc 0000000000000001
ffff8101ffec2e18 0000000000000000 ffff8101f7aa31d0 ffff8101ffec2e18
Call Trace:
[<ffffffff801d6276>] nfs_readpage_truncate_uninitialised_page+0x76/0xeb
[<ffffffff801d64fc>] nfs_readpage_sync+0x211/0x253
[<ffffffff801d6f89>] nfs_readpage+0x118/0x151
[<ffffffff8014c6fe>] do_generic_mapping_read+0x1ec/0x398
[<ffffffff8014c8aa>] file_read_actor+0x0/0xd1
[<ffffffff8014caf2>] __generic_file_aio_read+0x177/0x1b0
[<ffffffff8014cb5f>] generic_file_aio_read+0x34/0x39
[<ffffffff801cf3eb>] nfs_file_read+0x84/0x93
[<ffffffff8016e0fc>] do_sync_read+0xc9/0x106
[<ffffffff8013e5fd>] autoremove_wake_function+0x0/0x2e
[<ffffffff8015bf32>] do_mmap_pgoff+0x5fd/0x6de
[<ffffffff8016e1e6>] vfs_read+0xad/0x14c
[<ffffffff8016e520>] sys_read+0x45/0x6e
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 48 8b 07 48 c1 e8 3a 48 8b 14 c5 c0 79 5f 80 48 b8 b7 6d db
RIP [<ffffffff80150cad>] page_to_pfn+0x0/0x33
RSP <ffff8100fac73bb0>
CR2: 0000000000000000

--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (0.00 B)
(No filename) (373.00 B)
(No filename) (140.00 B)
Download all attachments

2006-09-18 08:08:15

by Andre Noll

[permalink] [raw]
Subject: Re: 2.6.18-rc5 page_to_pfn: Unable to handle kernel NULL pointer dereference

On 16:07, Trond Myklebust wrote:

> Does the attached patch fix it?

Patch applied and rebooted with no problems so far. I'll let you know
if it oopses again.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (0.00 B)
(No filename) (373.00 B)
(No filename) (140.00 B)
Download all attachments