2008-06-17 16:04:30

by Ian

[permalink] [raw]
Subject: Oops in NFS (RHEL4, but also in kernel bugzilla)


I have a server that hosts some large XFS filesystems and serves them
out over NFS. Every so often I get the following Oops, and then the
machine locks hard with blinky keyboard lights. ("Every so often" == I
can't reproduce this reliably. It comes up about once a week, we've
seen it three times.)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
00000000
*pde = 355bf001
Oops: 0000 [#1]
SMP
Modules linked in: nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc button battery ac ohci_hcd tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod aacraid aic7xxx sd_mod scsi_m
od
CPU: 0
EIP: 0060:[<00000000>] Not tainted VLI
EFLAGS: 00010282 (2.6.9-67.0.15.ELirsmp)
EIP is at 0x0
eax: e1c86c30 ebx: c04ba260 ecx: 00000000 edx: d820304c
esi: d820304c edi: f6ecbf00 ebp: 00000000 esp: f6ecbee4
ds: 007b es: 007b ss: 0068
Process nfsd (pid: 4339, threadinfo=f6ecb000 task=f6c470b0)
Stack: c0168c5f e1c86c30 ffffffff f5f96090 60229cac cc751afc c0168cd3 60229cac
00000008 f5f96088 e1c86ca0 e1c86ca0 e1c86c30 cc751afc f5f95004 f8bcee28
f5f96088 f7e6ba00 f7d351c0 f7e6ba00 f8b2b46a f5f95800 f5f95000 f5f951d4
Call Trace:
[<c0168c5f>] __lookup_hash+0x70/0x89
[<c0168cd3>] lookup_one_len+0x54/0x63
[<f8bcee28>] nfsd_lookup+0x321/0x3ad [nfsd]
[<f8b2b46a>] svcauth_unix_set_client+0xa7/0xb5 [sunrpc]
[<f8bd6b49>] nfsd3_proc_lookup+0xa9/0xb3 [nfsd]
[<f8bd8b37>] nfs3svc_decode_diropargs+0x0/0xfa [nfsd]
[<f8bcc681>] nfsd_dispatch+0xba/0x16d [nfsd]
[<f8b2862d>] svc_process+0x444/0x6f3 [sunrpc]
[<f8bcc45a>] nfsd+0x1cc/0x339 [nfsd]
[<f8bcc28e>] nfsd+0x0/0x339 [nfsd]
[<c01041f5>] kernel_thread_helper+0x5/0xb
Code: Bad EIP value.
<0>Fatal exception: panic in 5 seconds

This machine is running RHEL4, using the stock kernel but with XFS
enabled. I would have reported it to Redhat instead, but in googling
around found a nearly identical kernel bugzilla report:

http://bugzilla.kernel.org/show_bug.cgi?id=7809

In there, the bug reporter has tracked the Oops to __lookup_hash() in
fs/namei.c, and includes a patch which basically just takes care to not
dereference inode->i_op->lookup without checking it first.

I looked at the latest fs/namei.c via gitweb and it's the same code. So
here I am reporting it here, where more knowledgable and responsive
people lurk anyway.

Is this a NFS problem, or an XFS one? (Since XFS is common in both my
report and in the bugzilla one... I'm not sure whether the 'inode' in
question is NFS or from the underlying filesystem).

Is the bugzilla report's patch papering over a real problem, or does it
fix a real possible null-pointer case in __lookup_hash?

Thanks,
Ian


2008-06-19 02:33:18

by Daniel J Blueman

[permalink] [raw]
Subject: Re: Oops in NFS (RHEL4, but also in kernel bugzilla)

Hi Ian,

On 17 Jun, 17:10, Ian Soboroff <[email protected]> wrote:
> I have a server that hosts some large XFS filesystems and serves them
> out over NFS. Every so often I get the following Oops, and then the
> machine locks hard with blinky keyboard lights. ("Every so often" == I
> can't reproduce this reliably. It comes up about once a week, we've
> seen it three times.)
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> 00000000
> *pde = 355bf001
> Oops: 0000 [#1]
> SMP
> Modules linked in: nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc button battery ac ohci_hcd tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod aacraid aic7xxx sd_mod scsi_m
> od
> CPU: 0
> EIP: 0060:[<00000000>] Not tainted VLI
> EFLAGS: 00010282 (2.6.9-67.0.15.ELirsmp)
> EIP is at 0x0
> eax: e1c86c30 ebx: c04ba260 ecx: 00000000 edx: d820304c
> esi: d820304c edi: f6ecbf00 ebp: 00000000 esp: f6ecbee4
> ds: 007b es: 007b ss: 0068
> Process nfsd (pid: 4339, threadinfo=f6ecb000 task=f6c470b0)
> Stack: c0168c5f e1c86c30 ffffffff f5f96090 60229cac cc751afc c0168cd3 60229cac
> 00000008 f5f96088 e1c86ca0 e1c86ca0 e1c86c30 cc751afc f5f95004 f8bcee28
> f5f96088 f7e6ba00 f7d351c0 f7e6ba00 f8b2b46a f5f95800 f5f95000 f5f951d4
> Call Trace:
> [<c0168c5f>] __lookup_hash+0x70/0x89
> [<c0168cd3>] lookup_one_len+0x54/0x63
> [<f8bcee28>] nfsd_lookup+0x321/0x3ad [nfsd]
> [<f8b2b46a>] svcauth_unix_set_client+0xa7/0xb5 [sunrpc]
> [<f8bd6b49>] nfsd3_proc_lookup+0xa9/0xb3 [nfsd]
> [<f8bd8b37>] nfs3svc_decode_diropargs+0x0/0xfa [nfsd]
> [<f8bcc681>] nfsd_dispatch+0xba/0x16d [nfsd]
> [<f8b2862d>] svc_process+0x444/0x6f3 [sunrpc]
> [<f8bcc45a>] nfsd+0x1cc/0x339 [nfsd]
> [<f8bcc28e>] nfsd+0x0/0x339 [nfsd]
> [<c01041f5>] kernel_thread_helper+0x5/0xb
> Code: Bad EIP value.
> <0>Fatal exception: panic in 5 seconds

Has 4KB stacks been disabled? You can check the config file for CONFIG_4KSTACKS.

It may also be worth feeding that into the bugzilla entry, to
eliminate one possibility, as 'bad EIP value' looks suspicious of
stack corrption.

Daniel

> This machine is running RHEL4, using the stock kernel but with XFS
> enabled. I would have reported it to Redhat instead, but in googling
> around found a nearly identical kernel bugzilla report:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=7809
>
> In there, the bug reporter has tracked the Oops to __lookup_hash() in
> fs/namei.c, and includes a patch which basically just takes care to not
> dereference inode->i_op->lookup without checking it first.
>
> I looked at the latest fs/namei.c via gitweb and it's the same code. So
> here I am reporting it here, where more knowledgable and responsive
> people lurk anyway.
>
> Is this a NFS problem, or an XFS one? (Since XFS is common in both my
> report and in the bugzilla one... I'm not sure whether the 'inode' in
> question is NFS or from the underlying filesystem).
>
> Is the bugzilla report's patch papering over a real problem, or does it
> fix a real possible null-pointer case in __lookup_hash?
>
> Thanks,
> Ian
--
Daniel J Blueman

2008-06-23 17:00:18

by Ian

[permalink] [raw]
Subject: Re: Oops in NFS (RHEL4, but also in kernel bugzilla)

"Daniel J Blueman" <[email protected]> writes:

> Has 4KB stacks been disabled? You can check the config file for
> CONFIG_4KSTACKS.

This kernel has 4KSTACKS enabled.

> It may also be worth feeding that into the bugzilla entry, to
> eliminate one possibility, as 'bad EIP value' looks suspicious of
> stack corrption.

Ok, will do. Although that bugzilla entry is from 2007 and no one seems
to have looked at it at all...

Ian

2008-06-23 18:17:22

by Daniel J Blueman

[permalink] [raw]
Subject: Re: Oops in NFS (RHEL4, but also in kernel bugzilla)

On Mon, Jun 23, 2008 at 5:47 PM, Ian Soboroff <[email protected]> wrote:
> The following message is a courtesy copy of an article
> that has been posted to gmane.linux.kernel as well.
>
> "Daniel J Blueman" <[email protected]> writes:
>
>> Has 4KB stacks been disabled? You can check the config file for
>> CONFIG_4KSTACKS.
>
> This kernel has 4KSTACKS enabled.

There is chance that you've overrun the 4KB stack. Can you retest with
CONFIG_4KSTACK disabled perhaps?

>> It may also be worth feeding that into the bugzilla entry, to
>> eliminate one possibility, as 'bad EIP value' looks suspicious of
>> stack corrption.
>
> Ok, will do. Although that bugzilla entry is from 2007 and no one seems
> to have looked at it at all...
--
Daniel J Blueman

2008-06-23 18:30:30

by Ian

[permalink] [raw]
Subject: Re: Oops in NFS (RHEL4, but also in kernel bugzilla)

On Mon, Jun 23, 2008 at 2:17 PM, Daniel J Blueman
<[email protected]> wrote:

> There is chance that you've overrun the 4KB stack. Can you retest with
> CONFIG_4KSTACK disabled perhaps?

Testing is hard as the oops is not easily reproducible, but I'll
prepare a non-4KSTACKS kernel so that I can boot to it if we oops
again.

I'm still interested to hear from someone if the patch in bugzilla is
good for catching a real error case, or if it's papering over a larger
problem (for example a stack overrun).

Ian