2008-08-02 12:10:28

by Paul Collins

[permalink] [raw]
Subject: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

I just got the oops below on a ppc32 NFS4 server. I was cross-compiling
Linux with an amd64 client at the time. The server is running Linus's
tree as of 94ad374a0751f40d25e22e036c37f7263569d24c, the client is
running 2.6.26.

The server's kernel was cross-compiled with gcc 4.2.4-3 and binutils
2.18.50.20080610-1, both built from the Debian sources following their
toolchain-building procedures.

Annoyingly, I can't kill two of the client processes:

1 11634 11634 977 ? -1 D 1000 0:00 make ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- oldconfig vmlinux modules
1 23887 11634 977 ? -1 D 1000 0:00 [powerpc-linux-g]

Here's the oops. The instruction dump really was all Xes.

Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x00000000
Oops: Kernel access of bad area, sig: 11 [#1]
PowerMac
Modules linked in: snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus radeon drm b43 mac80211 cfg80211 pcmcia snd_pcm_oss snd_pcm snd_page_alloc yenta_socket rsrc_nonstatic pcmcia_core ssb uninorth_agp agpgart ehci_hcd ohci_hcd [last unloaded: snd_aoa_soundbus]
NIP: 00000000 LR: c0159a44 CTR: 00000000
REGS: c1d81c70 TRAP: 0400 Not tainted (2.6.27-rc1-00158-g643fbd8)
MSR: 40009032 <EE,ME,IR,DR> CR: 82002024 XER: 20000000
TASK = c1c7b210[2306] 'nfsd' THREAD: c1d80000
GPR00: c0159bcc c1d81d20 c1c7b210 82002044 c2e538d4 82002044 002e499d f92d835f
GPR08: 00000000 c2e55ac4 c2e55adc c1d81d50 82002024 00000000 018985fc 01898404
GPR16: 01898710 018c7894 c04b03f4 c04b03e0 c0173da0 c05f4e84 fffff000 00000001
GPR24: c00c05dc 00000000 82002044 00000000 c2e538d4 c2e538d4 c0437120 c1d81d20
NIP [00000000] 0x0
LR [c0159a44] find_acceptable_alias+0x44/0x108
Call Trace:
[c1d81d20] [c00cbab8] exportfs_d_alloc+0x40/0x70 (unreliable)
[c1d81d50] [c0159bcc] exportfs_decode_fh+0xc4/0x200
[c1d81e80] [c015d568] fh_verify+0x2e8/0x578
[c1d81ed0] [c016b1ec] nfsd4_putfh+0x60/0x78
[c1d81ef0] [c016afd0] nfsd4_proc_compound+0x1e4/0x34c
[c1d81f30] [c015a060] nfsd_dispatch+0xfc/0x220
[c1d81f50] [c0400c70] svc_process+0x3e4/0x6e8
[c1d81f90] [c015a8bc] nfsd+0x1c4/0x294
[c1d81fd0] [c0049e48] kthread+0x5c/0x9c
[c1d81ff0] [c00125c0] kernel_thread+0x44/0x60
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 88de9451d0d3e759 ]---

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood


2008-08-02 18:46:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Sun, Aug 03, 2008 at 12:03:18AM +1200, Paul Collins wrote:
> I just got the oops below on a ppc32 NFS4 server. I was cross-compiling
> Linux with an amd64 client at the time. The server is running Linus's
> tree as of 94ad374a0751f40d25e22e036c37f7263569d24c, the client is
> running 2.6.26.
>
> The server's kernel was cross-compiled with gcc 4.2.4-3 and binutils
> 2.18.50.20080610-1, both built from the Debian sources following their
> toolchain-building procedures.

Without having really thought about this,

496d6c32d4d057cb44272d9bd587ff97d023ee92 "nfsd: fix spurious
EACCESS in reconnect_path()"

is one suspect; it might be worth checking whether the problem's
reproduceable with that reverted. But I assume we're not so lucky as to
have a 100% reproduceable problem here?

What do your export options look like?

--b.

>
> Annoyingly, I can't kill two of the client processes:
>
> 1 11634 11634 977 ? -1 D 1000 0:00 make ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- oldconfig vmlinux modules
> 1 23887 11634 977 ? -1 D 1000 0:00 [powerpc-linux-g]
>
> Here's the oops. The instruction dump really was all Xes.
>
> Unable to handle kernel paging request for instruction fetch
> Faulting instruction address: 0x00000000
> Oops: Kernel access of bad area, sig: 11 [#1]
> PowerMac
> Modules linked in: snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus radeon drm b43 mac80211 cfg80211 pcmcia snd_pcm_oss snd_pcm snd_page_alloc yenta_socket rsrc_nonstatic pcmcia_core ssb uninorth_agp agpgart ehci_hcd ohci_hcd [last unloaded: snd_aoa_soundbus]
> NIP: 00000000 LR: c0159a44 CTR: 00000000
> REGS: c1d81c70 TRAP: 0400 Not tainted (2.6.27-rc1-00158-g643fbd8)
> MSR: 40009032 <EE,ME,IR,DR> CR: 82002024 XER: 20000000
> TASK = c1c7b210[2306] 'nfsd' THREAD: c1d80000
> GPR00: c0159bcc c1d81d20 c1c7b210 82002044 c2e538d4 82002044 002e499d f92d835f
> GPR08: 00000000 c2e55ac4 c2e55adc c1d81d50 82002024 00000000 018985fc 01898404
> GPR16: 01898710 018c7894 c04b03f4 c04b03e0 c0173da0 c05f4e84 fffff000 00000001
> GPR24: c00c05dc 00000000 82002044 00000000 c2e538d4 c2e538d4 c0437120 c1d81d20
> NIP [00000000] 0x0
> LR [c0159a44] find_acceptable_alias+0x44/0x108
> Call Trace:
> [c1d81d20] [c00cbab8] exportfs_d_alloc+0x40/0x70 (unreliable)
> [c1d81d50] [c0159bcc] exportfs_decode_fh+0xc4/0x200
> [c1d81e80] [c015d568] fh_verify+0x2e8/0x578
> [c1d81ed0] [c016b1ec] nfsd4_putfh+0x60/0x78
> [c1d81ef0] [c016afd0] nfsd4_proc_compound+0x1e4/0x34c
> [c1d81f30] [c015a060] nfsd_dispatch+0xfc/0x220
> [c1d81f50] [c0400c70] svc_process+0x3e4/0x6e8
> [c1d81f90] [c015a8bc] nfsd+0x1c4/0x294
> [c1d81fd0] [c0049e48] kthread+0x5c/0x9c
> [c1d81ff0] [c00125c0] kernel_thread+0x44/0x60
> Instruction dump:
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> ---[ end trace 88de9451d0d3e759 ]---
>
> --
> Paul Collins
> Wellington, New Zealand
>
> Dag vijandelijk luchtschip de huismeester is dood
> _______________________________________________
> NFSv4 mailing list
> [email protected]
> http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

2008-08-02 22:37:01

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

"J. Bruce Fields" <[email protected]> writes:

> On Sun, Aug 03, 2008 at 12:03:18AM +1200, Paul Collins wrote:
>> I just got the oops below on a ppc32 NFS4 server. I was cross-compiling
>> Linux with an amd64 client at the time. The server is running Linus's
>> tree as of 94ad374a0751f40d25e22e036c37f7263569d24c, the client is
>> running 2.6.26.
>>
>> The server's kernel was cross-compiled with gcc 4.2.4-3 and binutils
>> 2.18.50.20080610-1, both built from the Debian sources following their
>> toolchain-building procedures.
>
> Without having really thought about this,
>
> 496d6c32d4d057cb44272d9bd587ff97d023ee92 "nfsd: fix spurious
> EACCESS in reconnect_path()"
>
> is one suspect; it might be worth checking whether the problem's
> reproduceable with that reverted. But I assume we're not so lucky as to
> have a 100% reproduceable problem here?

Unknown. I've kicked off a fresh build. Here's hoping!

> What do your export options look like?

$ cat /etc/exports
/srv/nfsv4 *(sec=krb5:krb5i:krb5p,rw,fsid=0,crossmnt,insecure,no_subtree_check)
/srv/nfsv4/home/paul *(sec=krb5:krb5i:krb5p,rw,insecure,no_subtree_check)
$ mount | grep bind
/home/paul on /srv/nfsv4/home/paul type none (rw,bind)

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-03 06:47:51

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Paul Collins <[email protected]> writes:

> "J. Bruce Fields" <[email protected]> writes:
>
>> On Sun, Aug 03, 2008 at 12:03:18AM +1200, Paul Collins wrote:
>>> I just got the oops below on a ppc32 NFS4 server. I was cross-compiling
>>> Linux with an amd64 client at the time. The server is running Linus's
>>> tree as of 94ad374a0751f40d25e22e036c37f7263569d24c, the client is
>>> running 2.6.26.
>>>
>>> The server's kernel was cross-compiled with gcc 4.2.4-3 and binutils
>>> 2.18.50.20080610-1, both built from the Debian sources following their
>>> toolchain-building procedures.
>>
>> Without having really thought about this,
>>
>> 496d6c32d4d057cb44272d9bd587ff97d023ee92 "nfsd: fix spurious
>> EACCESS in reconnect_path()"
>>
>> is one suspect; it might be worth checking whether the problem's
>> reproduceable with that reverted. But I assume we're not so lucky as to
>> have a 100% reproduceable problem here?
>
> Unknown. I've kicked off a fresh build. Here's hoping!

I can trigger it reliably with a 2.6.26 client. I've also triggered it
with 496d6c32d4d057cb44272d9bd587ff97d023ee92 reverted on the server.

It's harder to trigger with 2.6.27-rc1+ but I managed to get an Oops
on the fourth build after three successful builds on the NFS4 mount.

One of the Oopses I got with 2.6.26 had a slightly different call trace:

Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x00000000
Oops: Kernel access of bad area, sig: 11 [#1]
PowerMac
Modules linked in: radeon drm snd_aoa_codec_tas snd_aoa_fabric_layout b43 mac80211 snd_aoa cfg80211 pcmcia snd_aoa_i2sbus snd_pcm_oss snd_pcm snd_page_alloc snd_aoa_soundbus yenta_socket rsrc_nonstatic pcmcia_core ssb uninorth_agp agpgart ehci_hcd ohci_hcd
NIP: 00000000 LR: c0159bb0 CTR: 00000000
REGS: c1f79ca0 TRAP: 0400 Not tainted (2.6.27-rc1-00158-g643fbd8)
MSR: 40009032 <EE,ME,IR,DR> CR: 22002022 XER: 20000000
TASK = c1ca5440[2321] 'nfsd' THREAD: c1f78000
GPR00: 00000008 c1f79d50 c1ca5440 82002044 edda2c54 002d6642 002d6642 e05f26bc
GPR08: ee126e60 00000000 eed7a600 c1f79d50 00000007 00000000 018985fc 01898404
GPR16: 01898710 018c7894 c04b03f4 c04b03e0 c0173da0 c05f4e84 fffff000 00000001
GPR24: c00c05dc 00000000 82002044 ef80cca0 edda2c54 ee10d020 c0437120 c1f79d50
NIP [00000000] 0x0
LR [c0159bb0] exportfs_decode_fh+0xa8/0x200
Call Trace:
[c1f79d50] [c0159b54] exportfs_decode_fh+0x4c/0x200 (unreliable)
[c1f79e80] [c015d568] fh_verify+0x2e8/0x578
[c1f79ed0] [c016b1ec] nfsd4_putfh+0x60/0x78
[c1f79ef0] [c016afd0] nfsd4_proc_compound+0x1e4/0x34c
[c1f79f30] [c015a060] nfsd_dispatch+0xfc/0x220
[c1f79f50] [c0400c70] svc_process+0x3e4/0x6e8
[c1f79f90] [c015a8bc] nfsd+0x1c4/0x294
[c1f79fd0] [c0049e48] kthread+0x5c/0x9c
[c1f79ff0] [c00125c0] kernel_thread+0x44/0x60
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 3dfa6e448b5c7077 ]---

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-03 12:09:54

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Sunday August 3, [email protected] wrote:
>
> I can trigger it reliably with a 2.6.26 client. I've also triggered it
> with 496d6c32d4d057cb44272d9bd587ff97d023ee92 reverted on the server.
>
> It's harder to trigger with 2.6.27-rc1+ but I managed to get an Oops
> on the fourth build after three successful builds on the NFS4 mount.
>
> One of the Oopses I got with 2.6.26 had a slightly different call trace:
>
> Unable to handle kernel paging request for instruction fetch
> Faulting instruction address: 0x00000000

So we have called a function pointer which was NULL.

There a lots of function pointers in use in this code.
There is the 'acceptable' function. There is ->fh_to_dentry
and ->fh_to_parent. And various inode operations line ->lookup, but
that is a bit further away.

> NIP [00000000] 0x0
> LR [c0159bb0] exportfs_decode_fh+0xa8/0x200

I guess this is where the call came from.
exportfs_decode_fh is never passed NULL for 'acceptable'. Only
ever 'nfsd_acceptable'.
->fh_to_parent is tested for NULL before being called, and
->fh_to_dentry is called very early in exportfs_decode_fh, where as
the bad call is 0xa8 in to the function.

Is it possible that ->fh_to_parent is being changed immediately after
being tested for NULL and before being dereferenced. That seems
unlikely.

What filesystem is being exported here?

Can you get an assembly version of exportfs_decode_fh, so we can check
what is happening at 0xa8 (and 0x4c).
Either "disassemble exportfs_decode_fh" in gdb, or
make fs/exportfs/expfs.i
(I think).

NeilBrown


> Call Trace:
> [c1f79d50] [c0159b54] exportfs_decode_fh+0x4c/0x200 (unreliable)
> [c1f79e80] [c015d568] fh_verify+0x2e8/0x578
> [c1f79ed0] [c016b1ec] nfsd4_putfh+0x60/0x78
> [c1f79ef0] [c016afd0] nfsd4_proc_compound+0x1e4/0x34c
> [c1f79f30] [c015a060] nfsd_dispatch+0xfc/0x220
> [c1f79f50] [c0400c70] svc_process+0x3e4/0x6e8
> [c1f79f90] [c015a8bc] nfsd+0x1c4/0x294
> [c1f79fd0] [c0049e48] kthread+0x5c/0x9c
> [c1f79ff0] [c00125c0] kernel_thread+0x44/0x60

2008-08-03 12:26:01

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Neil Brown <[email protected]> writes:

> On Sunday August 3, [email protected] wrote:
>>
>> I can trigger it reliably with a 2.6.26 client. I've also triggered it
>> with 496d6c32d4d057cb44272d9bd587ff97d023ee92 reverted on the server.
>>
>> It's harder to trigger with 2.6.27-rc1+ but I managed to get an Oops
>> on the fourth build after three successful builds on the NFS4 mount.
>>
>> One of the Oopses I got with 2.6.26 had a slightly different call trace:
>>
>> Unable to handle kernel paging request for instruction fetch
>> Faulting instruction address: 0x00000000
>
> So we have called a function pointer which was NULL.
>
> There a lots of function pointers in use in this code.
> There is the 'acceptable' function. There is ->fh_to_dentry
> and ->fh_to_parent. And various inode operations line ->lookup, but
> that is a bit further away.
>
>> NIP [00000000] 0x0
>> LR [c0159bb0] exportfs_decode_fh+0xa8/0x200
>
> I guess this is where the call came from.
> exportfs_decode_fh is never passed NULL for 'acceptable'. Only
> ever 'nfsd_acceptable'.
> ->fh_to_parent is tested for NULL before being called, and
> ->fh_to_dentry is called very early in exportfs_decode_fh, where as
> the bad call is 0xa8 in to the function.
>
> Is it possible that ->fh_to_parent is being changed immediately after
> being tested for NULL and before being dereferenced. That seems
> unlikely.
>
> What filesystem is being exported here?

Boring old ext3 (on LVM, on dm-crypt).

> Can you get an assembly version of exportfs_decode_fh, so we can check
> what is happening at 0xa8 (and 0x4c).

Dump of assembler code for function exportfs_decode_fh:
0xc015b7cc <exportfs_decode_fh+0>: mflr r0
0xc015b7d0 <exportfs_decode_fh+4>: stw r0,4(r1)
0xc015b7d4 <exportfs_decode_fh+8>: bl 0xc0013154 <_mcount>
0xc015b7d8 <exportfs_decode_fh+12>: stwu r1,-304(r1)
0xc015b7dc <exportfs_decode_fh+16>: mflr r0
0xc015b7e0 <exportfs_decode_fh+20>: stmw r22,264(r1)
0xc015b7e4 <exportfs_decode_fh+24>: mr r27,r3
0xc015b7e8 <exportfs_decode_fh+28>: mr r31,r1
0xc015b7ec <exportfs_decode_fh+32>: stw r0,308(r1)
0xc015b7f0 <exportfs_decode_fh+36>: mr r25,r7
0xc015b7f4 <exportfs_decode_fh+40>: mr r26,r8
0xc015b7f8 <exportfs_decode_fh+44>: mr r29,r4
0xc015b7fc <exportfs_decode_fh+48>: mr r24,r5
0xc015b800 <exportfs_decode_fh+52>: mr r23,r6
0xc015b804 <exportfs_decode_fh+56>: lwz r3,20(r3)
0xc015b808 <exportfs_decode_fh+60>: lwz r30,48(r3)
0xc015b80c <exportfs_decode_fh+64>: lwz r0,4(r30)
0xc015b810 <exportfs_decode_fh+68>: mtctr r0
0xc015b814 <exportfs_decode_fh+72>: bctrl
0xc015b818 <exportfs_decode_fh+76>: mr. r28,r3
0xc015b81c <exportfs_decode_fh+80>: bne+ 0xc015b824 <exportfs_decode_fh+88>
0xc015b820 <exportfs_decode_fh+84>: li r28,-116
0xc015b824 <exportfs_decode_fh+88>: li r22,-4096
0xc015b828 <exportfs_decode_fh+92>: cmplw cr7,r28,r22
0xc015b82c <exportfs_decode_fh+96>: bgt- cr7,0xc015b9b0 <exportfs_decode_fh+484>
0xc015b830 <exportfs_decode_fh+100>: lwz r9,8(r28)
0xc015b834 <exportfs_decode_fh+104>: lhz r0,114(r9)
0xc015b838 <exportfs_decode_fh+108>: rlwinm r0,r0,0,16,19
0xc015b83c <exportfs_decode_fh+112>: cmpwi cr7,r0,16384
0xc015b840 <exportfs_decode_fh+116>: bne- cr7,0xc015b880 <exportfs_decode_fh+180>
0xc015b844 <exportfs_decode_fh+120>: lwz r0,4(r28)
0xc015b848 <exportfs_decode_fh+124>: andi. r9,r0,4
0xc015b84c <exportfs_decode_fh+128>: beq- 0xc015b864 <exportfs_decode_fh+152>
0xc015b850 <exportfs_decode_fh+132>: mr r3,r27
0xc015b854 <exportfs_decode_fh+136>: mr r4,r28
0xc015b858 <exportfs_decode_fh+140>: bl 0xc015b45c <reconnect_path>
0xc015b85c <exportfs_decode_fh+144>: mr. r30,r3
0xc015b860 <exportfs_decode_fh+148>: bne- 0xc015b9a4 <exportfs_decode_fh+472>
0xc015b864 <exportfs_decode_fh+152>: mr r3,r26
0xc015b868 <exportfs_decode_fh+156>: mr r4,r28
0xc015b86c <exportfs_decode_fh+160>: mtctr r25
0xc015b870 <exportfs_decode_fh+164>: bctrl
0xc015b874 <exportfs_decode_fh+168>: cmpwi cr7,r3,0
0xc015b878 <exportfs_decode_fh+172>: beq+ cr7,0xc015b998 <exportfs_decode_fh+460>
0xc015b87c <exportfs_decode_fh+176>: b 0xc015b9b0 <exportfs_decode_fh+484>
0xc015b880 <exportfs_decode_fh+180>: mr r3,r28
0xc015b884 <exportfs_decode_fh+184>: mr r4,r25
0xc015b888 <exportfs_decode_fh+188>: mr r5,r26
0xc015b88c <exportfs_decode_fh+192>: bl 0xc015b6c4 <find_acceptable_alias>
0xc015b890 <exportfs_decode_fh+196>: cmpwi r3,0
0xc015b894 <exportfs_decode_fh+200>: bne+ 0xc015b990 <exportfs_decode_fh+452>
0xc015b898 <exportfs_decode_fh+204>: lwz r0,8(r30)
0xc015b89c <exportfs_decode_fh+208>: cmpwi cr7,r0,0
0xc015b8a0 <exportfs_decode_fh+212>: beq- cr7,0xc015b9a0 <exportfs_decode_fh+468>
0xc015b8a4 <exportfs_decode_fh+216>: mr r4,r29
0xc015b8a8 <exportfs_decode_fh+220>: mr r5,r24
0xc015b8ac <exportfs_decode_fh+224>: lwz r3,20(r27)
0xc015b8b0 <exportfs_decode_fh+228>: mtctr r0
0xc015b8b4 <exportfs_decode_fh+232>: mr r6,r23
0xc015b8b8 <exportfs_decode_fh+236>: bctrl
0xc015b8bc <exportfs_decode_fh+240>: mr. r29,r3
0xc015b8c0 <exportfs_decode_fh+244>: beq- 0xc015b9a0 <exportfs_decode_fh+468>
0xc015b8c4 <exportfs_decode_fh+248>: cmplw cr7,r29,r22
0xc015b8c8 <exportfs_decode_fh+252>: mr r30,r29
0xc015b8cc <exportfs_decode_fh+256>: bgt- cr7,0xc015b9a4 <exportfs_decode_fh+472>
0xc015b8d0 <exportfs_decode_fh+260>: mr r3,r27
0xc015b8d4 <exportfs_decode_fh+264>: mr r4,r29
0xc015b8d8 <exportfs_decode_fh+268>: bl 0xc015b45c <reconnect_path>
0xc015b8dc <exportfs_decode_fh+272>: mr. r30,r3
0xc015b8e0 <exportfs_decode_fh+276>: beq- 0xc015b8f0 <exportfs_decode_fh+292>
0xc015b8e4 <exportfs_decode_fh+280>: mr r3,r29
0xc015b8e8 <exportfs_decode_fh+284>: bl 0xc00befb0 <dput>
0xc015b8ec <exportfs_decode_fh+288>: b 0xc015b9a4 <exportfs_decode_fh+472>
0xc015b8f0 <exportfs_decode_fh+292>: addi r30,r31,8
0xc015b8f4 <exportfs_decode_fh+296>: mr r3,r27
0xc015b8f8 <exportfs_decode_fh+300>: mr r4,r29
0xc015b8fc <exportfs_decode_fh+304>: mr r5,r30
0xc015b900 <exportfs_decode_fh+308>: mr r6,r28
0xc015b904 <exportfs_decode_fh+312>: bl 0xc015b2cc <exportfs_get_name>
0xc015b908 <exportfs_decode_fh+316>: cmpwi cr7,r3,0
0xc015b90c <exportfs_decode_fh+320>: bne+ cr7,0xc015b970 <exportfs_decode_fh+420>
0xc015b910 <exportfs_decode_fh+324>: lwz r3,8(r29)
0xc015b914 <exportfs_decode_fh+328>: addi r3,r3,116
0xc015b918 <exportfs_decode_fh+332>: bl 0xc0421bb0 <mutex_lock>
0xc015b91c <exportfs_decode_fh+336>: mr r3,r30
0xc015b920 <exportfs_decode_fh+340>: bl 0xc00188fc <strlen>
0xc015b924 <exportfs_decode_fh+344>: mr r4,r29
0xc015b928 <exportfs_decode_fh+348>: mr r5,r3
0xc015b92c <exportfs_decode_fh+352>: mr r3,r30
0xc015b930 <exportfs_decode_fh+356>: bl 0xc00b4e44 <lookup_one_len>
0xc015b934 <exportfs_decode_fh+360>: mr r30,r3
0xc015b938 <exportfs_decode_fh+364>: lwz r3,8(r29)
0xc015b93c <exportfs_decode_fh+368>: addi r3,r3,116
0xc015b940 <exportfs_decode_fh+372>: bl 0xc04219a8 <mutex_unlock>
0xc015b944 <exportfs_decode_fh+376>: cmplw cr7,r30,r22
0xc015b948 <exportfs_decode_fh+380>: bgt- cr7,0xc015b970 <exportfs_decode_fh+420>
0xc015b94c <exportfs_decode_fh+384>: lwz r0,8(r30)
0xc015b950 <exportfs_decode_fh+388>: cmpwi cr7,r0,0
0xc015b954 <exportfs_decode_fh+392>: beq- cr7,0xc015b968 <exportfs_decode_fh+412>
0xc015b958 <exportfs_decode_fh+396>: mr r3,r28
0xc015b95c <exportfs_decode_fh+400>: mr r28,r30
0xc015b960 <exportfs_decode_fh+404>: bl 0xc00befb0 <dput>
0xc015b964 <exportfs_decode_fh+408>: b 0xc015b970 <exportfs_decode_fh+420>
0xc015b968 <exportfs_decode_fh+412>: mr r3,r30
0xc015b96c <exportfs_decode_fh+416>: bl 0xc00befb0 <dput>
0xc015b970 <exportfs_decode_fh+420>: mr r3,r29
0xc015b974 <exportfs_decode_fh+424>: bl 0xc00befb0 <dput>
0xc015b978 <exportfs_decode_fh+428>: mr r3,r28
0xc015b97c <exportfs_decode_fh+432>: mr r4,r25
0xc015b980 <exportfs_decode_fh+436>: mr r5,r26
0xc015b984 <exportfs_decode_fh+440>: bl 0xc015b6c4 <find_acceptable_alias>
0xc015b988 <exportfs_decode_fh+444>: cmpwi r3,0
0xc015b98c <exportfs_decode_fh+448>: beq- 0xc015b998 <exportfs_decode_fh+460>
0xc015b990 <exportfs_decode_fh+452>: mr r28,r3
0xc015b994 <exportfs_decode_fh+456>: b 0xc015b9b0 <exportfs_decode_fh+484>
0xc015b998 <exportfs_decode_fh+460>: li r30,-13
0xc015b99c <exportfs_decode_fh+464>: b 0xc015b9a4 <exportfs_decode_fh+472>
0xc015b9a0 <exportfs_decode_fh+468>: li r30,-116
0xc015b9a4 <exportfs_decode_fh+472>: mr r3,r28
0xc015b9a8 <exportfs_decode_fh+476>: mr r28,r30
0xc015b9ac <exportfs_decode_fh+480>: bl 0xc00befb0 <dput>
0xc015b9b0 <exportfs_decode_fh+484>: lwz r11,0(r1)
0xc015b9b4 <exportfs_decode_fh+488>: mr r3,r28
0xc015b9b8 <exportfs_decode_fh+492>: lwz r0,4(r11)
0xc015b9bc <exportfs_decode_fh+496>: lmw r22,-40(r11)
0xc015b9c0 <exportfs_decode_fh+500>: mr r1,r11
0xc015b9c4 <exportfs_decode_fh+504>: mtlr r0
0xc015b9c8 <exportfs_decode_fh+508>: blr
End of assembler dump.

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-04 04:08:26

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Monday August 4, [email protected] wrote:
> Neil Brown <[email protected]> writes:
> >
> > What filesystem is being exported here?
>
> Boring old ext3 (on LVM, on dm-crypt).


Good. That makes it easier.
>
> > Can you get an assembly version of exportfs_decode_fh, so we can check
> > what is happening at 0xa8 (and 0x4c).
>

Thanks.

bctrl appears to be the indirect-function-call opcode. There are
three of them one each for
->fh_to_dentry
acceptable
->fh_to_parent

0xa8 is 'acceptable'.

In the first traceback, the crash was a call from very early in
find_acceptable_alias, The first significant thing it does is call
the 'acceptable' function.

So it seems clear that 'acceptable' is NULL.
It is equally clear that we never ever set it to NULL in the code.
The logical conclusion is "compiler error".
We can confirm (hopefully) by looking at a disassembly of fh_verify.

Maybe because nfsd_acceptable is 'static' and never explicitly called,
gcc gets confused and optimises it away. Maybe a disassembly of
nfsd_acceptable would be informative ... particularly if it turns out
to be empty.

Could you try removing the 'static' declaration for nfsd_acceptable
and recompile?
Or maybe try a different compiler?

Thanks,
NeilBrown

2008-08-04 05:11:17

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Neil Brown <[email protected]> writes:

> bctrl appears to be the indirect-function-call opcode. There are
> three of them one each for
> ->fh_to_dentry
> acceptable
> ->fh_to_parent
>
> 0xa8 is 'acceptable'.
>
> In the first traceback, the crash was a call from very early in
> find_acceptable_alias, The first significant thing it does is call
> the 'acceptable' function.
>
> So it seems clear that 'acceptable' is NULL.
> It is equally clear that we never ever set it to NULL in the code.
> The logical conclusion is "compiler error".
> We can confirm (hopefully) by looking at a disassembly of fh_verify.
>
> Maybe because nfsd_acceptable is 'static' and never explicitly called,
> gcc gets confused and optimises it away. Maybe a disassembly of
> nfsd_acceptable would be informative ... particularly if it turns out
> to be empty.

Here's the disassembly.

Dump of assembler code for function nfsd_acceptable:
0xc015f450 <nfsd_acceptable+0>: mflr r0
0xc015f454 <nfsd_acceptable+4>: stw r0,4(r1)
0xc015f458 <nfsd_acceptable+8>: bl 0xc0013154 <_mcount>
0xc015f45c <nfsd_acceptable+12>: stwu r1,-32(r1)
0xc015f460 <nfsd_acceptable+16>: mflr r0
0xc015f464 <nfsd_acceptable+20>: stmw r28,16(r1)
0xc015f468 <nfsd_acceptable+24>: mr r28,r3
0xc015f46c <nfsd_acceptable+28>: mr r31,r1
0xc015f470 <nfsd_acceptable+32>: stw r0,36(r1)
0xc015f474 <nfsd_acceptable+36>: li r30,1
0xc015f478 <nfsd_acceptable+40>: lwz r0,24(r3)
0xc015f47c <nfsd_acceptable+44>: mr r3,r4
0xc015f480 <nfsd_acceptable+48>: andi. r9,r0,1024
0xc015f484 <nfsd_acceptable+52>: bne- 0xc015f56c <nfsd_acceptable+284>
0xc015f488 <nfsd_acceptable+56>: cmpwi cr7,r4,0
0xc015f48c <nfsd_acceptable+60>: beq- cr7,0xc015f4b0 <nfsd_acceptable+96>
0xc015f490 <nfsd_acceptable+64>: lwz r0,0(r4)
0xc015f494 <nfsd_acceptable+68>: cntlzw r0,r0
0xc015f498 <nfsd_acceptable+72>: rlwinm r0,r0,27,5,31
0xc015f49c <nfsd_acceptable+76>: twnei r0,0
0xc015f4a0 <nfsd_acceptable+80>: lwarx r0,0,r4
0xc015f4a4 <nfsd_acceptable+84>: addic r0,r0,1
0xc015f4a8 <nfsd_acceptable+88>: stwcx. r0,0,r4
0xc015f4ac <nfsd_acceptable+92>: bne- 0xc015f4a0 <nfsd_acceptable+80>
0xc015f4b0 <nfsd_acceptable+96>: mr r29,r3
0xc015f4b4 <nfsd_acceptable+100>: b 0xc015f508 <nfsd_acceptable+184>
0xc015f4b8 <nfsd_acceptable+104>: beq- cr6,0xc015f4dc <nfsd_acceptable+140>
0xc015f4bc <nfsd_acceptable+108>: lwz r0,0(r30)
0xc015f4c0 <nfsd_acceptable+112>: cntlzw r0,r0
0xc015f4c4 <nfsd_acceptable+116>: rlwinm r0,r0,27,5,31
0xc015f4c8 <nfsd_acceptable+120>: twnei r0,0
0xc015f4cc <nfsd_acceptable+124>: lwarx r0,0,r30
0xc015f4d0 <nfsd_acceptable+128>: addic r0,r0,1
0xc015f4d4 <nfsd_acceptable+132>: stwcx. r0,0,r30
0xc015f4d8 <nfsd_acceptable+136>: bne- 0xc015f4cc <nfsd_acceptable+124>
0xc015f4dc <nfsd_acceptable+140>: lwz r3,8(r30)
0xc015f4e0 <nfsd_acceptable+144>: li r4,1
0xc015f4e4 <nfsd_acceptable+148>: bl 0xc00b2f50 <inode_permission>
0xc015f4e8 <nfsd_acceptable+152>: cmpwi cr7,r3,0
0xc015f4ec <nfsd_acceptable+156>: mr r3,r29
0xc015f4f0 <nfsd_acceptable+160>: bge+ cr7,0xc015f500 <nfsd_acceptable+176>
0xc015f4f4 <nfsd_acceptable+164>: mr r3,r30
0xc015f4f8 <nfsd_acceptable+168>: bl 0xc00befb0 <dput>
0xc015f4fc <nfsd_acceptable+172>: b 0xc015f524 <nfsd_acceptable+212>
0xc015f500 <nfsd_acceptable+176>: bl 0xc00befb0 <dput>
0xc015f504 <nfsd_acceptable+180>: mr r29,r30
0xc015f508 <nfsd_acceptable+184>: lwz r0,32(r28)
0xc015f50c <nfsd_acceptable+188>: cmpw cr7,r29,r0
0xc015f510 <nfsd_acceptable+192>: beq- cr7,0xc015f524 <nfsd_acceptable+212>
0xc015f514 <nfsd_acceptable+196>: lwz r30,20(r29)
0xc015f518 <nfsd_acceptable+200>: cmpw cr7,r29,r30
0xc015f51c <nfsd_acceptable+204>: cmpwi cr6,r30,0
0xc015f520 <nfsd_acceptable+208>: bne+ cr7,0xc015f4b8 <nfsd_acceptable+104>
0xc015f524 <nfsd_acceptable+212>: lwz r0,32(r28)
0xc015f528 <nfsd_acceptable+216>: cmpw cr7,r29,r0
0xc015f52c <nfsd_acceptable+220>: beq- cr7,0xc015f554 <nfsd_acceptable+260>
0xc015f530 <nfsd_acceptable+224>: lis r9,-16296
0xc015f534 <nfsd_acceptable+228>: lwz r0,17792(r9)
0xc015f538 <nfsd_acceptable+232>: andi. r9,r0,2
0xc015f53c <nfsd_acceptable+236>: beq+ 0xc015f554 <nfsd_acceptable+260>
0xc015f540 <nfsd_acceptable+240>: lis r3,-16309
0xc015f544 <nfsd_acceptable+244>: lwz r5,32(r29)
0xc015f548 <nfsd_acceptable+248>: mr r4,r29
0xc015f54c <nfsd_acceptable+252>: addi r3,r3,7972
0xc015f550 <nfsd_acceptable+256>: bl 0xc00330d4 <printk>
0xc015f554 <nfsd_acceptable+260>: lwz r0,32(r28)
0xc015f558 <nfsd_acceptable+264>: mr r3,r29
0xc015f55c <nfsd_acceptable+268>: xor r30,r29,r0
0xc015f560 <nfsd_acceptable+272>: cntlzw r30,r30
0xc015f564 <nfsd_acceptable+276>: rlwinm r30,r30,27,5,31
0xc015f568 <nfsd_acceptable+280>: bl 0xc00befb0 <dput>
0xc015f56c <nfsd_acceptable+284>: lwz r11,0(r1)
0xc015f570 <nfsd_acceptable+288>: mr r3,r30
0xc015f574 <nfsd_acceptable+292>: lwz r0,4(r11)
0xc015f578 <nfsd_acceptable+296>: lmw r28,-16(r11)
0xc015f57c <nfsd_acceptable+300>: mr r1,r11
0xc015f580 <nfsd_acceptable+304>: mtlr r0
0xc015f584 <nfsd_acceptable+308>: blr
End of assembler dump.

> Could you try removing the 'static' declaration for nfsd_acceptable
> and recompile?
> Or maybe try a different compiler?

I will give these a try this evening.

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-04 10:01:19

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Paul Collins <[email protected]> writes:

> Neil Brown <[email protected]> writes:
>> Could you try removing the 'static' declaration for nfsd_acceptable
>> and recompile?
>> Or maybe try a different compiler?
>
> I will give these a try this evening.

I built myself a nice new cross compiler:

powerpc-linux-gnu-gcc-4.1 (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)

and rebuilt 94ad374a0751f40d25e22e036c37f7263569d24c. Running that on
the server and 2.6.26 on the client, I got yet another Oops. This one
locked the machine up pretty good, so all I have is a picture:

http://ondioline.org/~paul/DSCN1608.JPG

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-04 14:37:08

by Michael Ellerman

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Mon, 2008-08-04 at 22:00 +1200, Paul Collins wrote:
> Paul Collins <[email protected]> writes:
>
> > Neil Brown <[email protected]> writes:
> >> Could you try removing the 'static' declaration for nfsd_acceptable
> >> and recompile?
> >> Or maybe try a different compiler?
> >
> > I will give these a try this evening.
>
> I built myself a nice new cross compiler:
>
> powerpc-linux-gnu-gcc-4.1 (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)
>
> and rebuilt 94ad374a0751f40d25e22e036c37f7263569d24c. Running that on
> the server and 2.6.26 on the client, I got yet another Oops. This one
> locked the machine up pretty good, so all I have is a picture:
>
> http://ondioline.org/~paul/DSCN1608.JPG

Wow.

Can you try building a kernel on the server? ie. not over NFS.

cheers

--
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2008-08-04 20:51:43

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Michael Ellerman <[email protected]> writes:

> On Mon, 2008-08-04 at 22:00 +1200, Paul Collins wrote:
>> Paul Collins <[email protected]> writes:
>>
>> > Neil Brown <[email protected]> writes:
>> >> Could you try removing the 'static' declaration for nfsd_acceptable
>> >> and recompile?
>> >> Or maybe try a different compiler?
>> >
>> > I will give these a try this evening.
>>
>> I built myself a nice new cross compiler:
>>
>> powerpc-linux-gnu-gcc-4.1 (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)
>>
>> and rebuilt 94ad374a0751f40d25e22e036c37f7263569d24c. Running that on
>> the server and 2.6.26 on the client, I got yet another Oops. This one
>> locked the machine up pretty good, so all I have is a picture:
>>
>> http://ondioline.org/~paul/DSCN1608.JPG
>
> Wow.
>
> Can you try building a kernel on the server? ie. not over NFS.

Built kernels on the server with native gcc 4.2.4 and 4.3.1 and repeated
the build test. Both of them threw an Oops with traces like the ones
we've seen before. Also, because now's about time to start shooting in
the dark, I tried a cross-built kernel with CC_OPTIMIZE_FOR_SIZE
disabled. That one Oopses too. Also I reseated the server's memory.

Although I've been able to build kernels over NFS with 2.6.26 on the
server, I've just realized I haven't tried to stress it much. I'll try
a build loop when the server's running 2.6.26 this evening.

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-04 20:59:27

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Tue, Aug 05, 2008 at 08:51:23AM +1200, Paul Collins wrote:
> Michael Ellerman <[email protected]> writes:
>
> > On Mon, 2008-08-04 at 22:00 +1200, Paul Collins wrote:
> >> Paul Collins <[email protected]> writes:
> >>
> >> > Neil Brown <[email protected]> writes:
> >> >> Could you try removing the 'static' declaration for nfsd_acceptable
> >> >> and recompile?
> >> >> Or maybe try a different compiler?
> >> >
> >> > I will give these a try this evening.
> >>
> >> I built myself a nice new cross compiler:
> >>
> >> powerpc-linux-gnu-gcc-4.1 (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)
> >>
> >> and rebuilt 94ad374a0751f40d25e22e036c37f7263569d24c. Running that on
> >> the server and 2.6.26 on the client, I got yet another Oops. This one
> >> locked the machine up pretty good, so all I have is a picture:
> >>
> >> http://ondioline.org/~paul/DSCN1608.JPG
> >
> > Wow.
> >
> > Can you try building a kernel on the server? ie. not over NFS.
>
> Built kernels on the server with native gcc 4.2.4 and 4.3.1 and repeated
> the build test.

But the build test itself was over nfs? (And you can't reproduce the
same problem without nfs?)

--b.

> Both of them threw an Oops with traces like the ones
> we've seen before. Also, because now's about time to start shooting in
> the dark, I tried a cross-built kernel with CC_OPTIMIZE_FOR_SIZE
> disabled. That one Oopses too. Also I reseated the server's memory.
>
> Although I've been able to build kernels over NFS with 2.6.26 on the
> server, I've just realized I haven't tried to stress it much. I'll try
> a build loop when the server's running 2.6.26 this evening.
>
> --
> Paul Collins
> Wellington, New Zealand
>
> Dag vijandelijk luchtschip de huismeester is dood

2008-08-05 00:17:18

by Michael Ellerman

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Mon, 2008-08-04 at 16:59 -0400, J. Bruce Fields wrote:
> On Tue, Aug 05, 2008 at 08:51:23AM +1200, Paul Collins wrote:
> > Michael Ellerman <[email protected]> writes:
> >
> > > On Mon, 2008-08-04 at 22:00 +1200, Paul Collins wrote:
> > >> Paul Collins <[email protected]> writes:
> > >>
> > >> > Neil Brown <[email protected]> writes:
> > >> >> Could you try removing the 'static' declaration for nfsd_acceptable
> > >> >> and recompile?
> > >> >> Or maybe try a different compiler?
> > >> >
> > >> > I will give these a try this evening.
> > >>
> > >> I built myself a nice new cross compiler:
> > >>
> > >> powerpc-linux-gnu-gcc-4.1 (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)
> > >>
> > >> and rebuilt 94ad374a0751f40d25e22e036c37f7263569d24c. Running that on
> > >> the server and 2.6.26 on the client, I got yet another Oops. This one
> > >> locked the machine up pretty good, so all I have is a picture:
> > >>
> > >> http://ondioline.org/~paul/DSCN1608.JPG
> > >
> > > Wow.
> > >
> > > Can you try building a kernel on the server? ie. not over NFS.
> >
> > Built kernels on the server with native gcc 4.2.4 and 4.3.1 and repeated
> > the build test.
>
> But the build test itself was over nfs? (And you can't reproduce the
> same problem without nfs?)

Yeah, I'm not clear on that either. What I was aiming at was can you get
it to oops somewhere else by not building over NFS - in which case we
can rule NFS (more or less) out.

cheers

--
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2008-08-05 03:44:08

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Michael Ellerman <[email protected]> writes:

> On Mon, 2008-08-04 at 16:59 -0400, J. Bruce Fields wrote:
>> On Tue, Aug 05, 2008 at 08:51:23AM +1200, Paul Collins wrote:
>> > Michael Ellerman <[email protected]> writes:
>> >
>> > > On Mon, 2008-08-04 at 22:00 +1200, Paul Collins wrote:
>> > >> Paul Collins <[email protected]> writes:
>> > >>
>> > >> > Neil Brown <[email protected]> writes:
>> > >> >> Could you try removing the 'static' declaration for nfsd_acceptable
>> > >> >> and recompile?
>> > >> >> Or maybe try a different compiler?
>> > >> >
>> > >> > I will give these a try this evening.
>> > >>
>> > >> I built myself a nice new cross compiler:
>> > >>
>> > >> powerpc-linux-gnu-gcc-4.1 (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)
>> > >>
>> > >> and rebuilt 94ad374a0751f40d25e22e036c37f7263569d24c. Running that on
>> > >> the server and 2.6.26 on the client, I got yet another Oops. This one
>> > >> locked the machine up pretty good, so all I have is a picture:
>> > >>
>> > >> http://ondioline.org/~paul/DSCN1608.JPG
>> > >
>> > > Wow.
>> > >
>> > > Can you try building a kernel on the server? ie. not over NFS.
>> >
>> > Built kernels on the server with native gcc 4.2.4 and 4.3.1 and repeated
>> > the build test.
>>
>> But the build test itself was over nfs? (And you can't reproduce the
>> same problem without nfs?)
>
> Yeah, I'm not clear on that either. What I was aiming at was can you get
> it to oops somewhere else by not building over NFS - in which case we
> can rule NFS (more or less) out.

I think may be able to rule NFS out now. I just got this Oops when Xorg
started on boot.

Unable to handle kernel paging request for data at address 0x00000949
Faulting instruction address: 0xc0104190
Oops: Kernel access of bad area, sig: 11 [#1]
PowerMac
Modules linked in: snd_aoa_codec_tas snd_aoa_fabric_layout b43 snd_aoa mac80211 cfg80211 pcmcia snd_aoa_i2sbus snd_pcm_oss snd_pcm snd_page_alloc snd_aoa_soundbus yenta_socket rsrc_nonstatic pcmcia_core ssb uninorth_agp agpgart ehci_hcd ohci_hcd
NIP: c0104190 LR: c0104138 CTR: c01fbd8c
REGS: eee89c40 TRAP: 0300 Not tainted (2.6.27-rc1-00158-g643fbd8)
MSR: 00009032 <EE,ME,IR,DR> CR: 88088222 XER: 20000000
DAR: 00000949, DSISR: 42000000
TASK = c1ebb840[2528] 'Xorg' THREAD: eee88000
GPR00: c0104138 eee89cf0 c1ebb840 00000901 ef507d20 00000007 c0620000 c061ca30
GPR08: ef507d08 c05ae45c 9e370001 eee89cf0 28002248 101f3ca4 101ee800 101ebf1c
GPR16: eee89e2c fffffff4 c05d0000 eee89d60 ffffffd8 eee89d68 101ebf20 ef4ebab4
GPR24: c00d0148 00000000 28088222 ef4eba40 00000901 f0000627 ef3ee5e0 eee89cf0
NIP [c0104190] proc_lookup_de+0xe0/0xf8
LR [c0104138] proc_lookup_de+0x88/0xf8
Call Trace:
[eee89cf0] [c0104138] proc_lookup_de+0x88/0xf8 (unreliable)
[eee89d10] [c010467c] proc_lookup+0x34/0x4c
[eee89d20] [c00c034c] do_lookup+0x1a4/0x220
[eee89d50] [c00c2010] __link_path_walk+0x18c/0xdd4
[eee89dc0] [c00c2cb0] path_walk+0x58/0xe0
[eee89df0] [c00c2e68] do_path_lookup+0x78/0x17c
[eee89e20] [c00c3b58] user_path_at+0x64/0xa4
[eee89e90] [c00baa64] vfs_stat_fd+0x34/0x74
[eee89ec0] [c00bac2c] vfs_stat+0x30/0x48
[eee89ed0] [c00bac74] sys_stat64+0x30/0x5c
[eee89f40] [c0013aa8] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfc4a300
LR = 0xfc4a2b8
Instruction dump:
4e800020 3860fffe 81610000 800b0004 bb6bffec 7d615b78 7c0803a6 4e800020
3d20c05b 7c641b78 3929e45c 7f83e378 <913c0048> 4bfc98bd 7f83e378 4bfc7dcd
---[ end trace 9be805d8b3000d04 ]---


And earlier today I got these three Oopses when I did "du -sh *" in my
homedir:

Oops: Exception in kernel mode, sig: 4 [#1]
PowerMac
Modules linked in: option radeon drm snd_aoa_codec_tas snd_aoa_fabric_layout b43 mac80211 snd_aoa cfg80211 pcmcia snd_aoa_i2sbus snd_pcm_oss snd_pcm snd_page_alloc snd_aoa_soundbus yenta_socket rsrc_nonstatic pcmcia_core ssb uninorth_agp agpgart ehci_hcd ohci_hcd
NIP: c00d01b4 LR: c00d0148 CTR: c01fbd8c
REGS: ec42bbf0 TRAP: 0700 Not tainted (2.6.27-rc1-00158-g643fbd8)
MSR: 00089032 <EE,ME,IR,DR> CR: 24088428 XER: 00000000
TASK = c1c837e0[3610] 'du' THREAD: ec42a000
GPR00: eee15e7c ec42bca0 c1c837e0 00000000 c0f1d4fc 002de7e6 ef306bb0 e9d3d104
GPR08: c0650000 e9d3fe84 e9d3fe7c c05d0000 24088422 10029cb8 10010a8c 10010f5c
GPR16: ec42be3c fffffff4 c05d0000 ec42bd70 ffffffd8 ec42bd78 1000f940 ee0c5338
GPR24: e9d3fe74 c0f0ae20 000126dc c0f1d4fc eee15e00 00000000 002de7e6 ec42bca0
NIP [c00d01b4] iget_locked+0xfc/0x148
LR [c00d0148] iget_locked+0x90/0x148
Call Trace:
[ec42bca0] [c00d0148] iget_locked+0x90/0x148 (unreliable)
[ec42bcd0] [c011cd60] ext3_iget+0x24/0x53c
[ec42bd00] [c0120dbc] ext3_lookup+0x108/0x144
[ec42bd30] [c00c034c] do_lookup+0x1a4/0x220
[ec42bd60] [c00c22ac] __link_path_walk+0x428/0xdd4
[ec42bdd0] [c00c2cb0] path_walk+0x58/0xe0
[ec42be00] [c00c2e68] do_path_lookup+0x78/0x17c
[ec42be30] [c00c3b58] user_path_at+0x64/0xa4
[ec42bea0] [c00ba884] vfs_lstat_fd+0x34/0x74
[ec42bed0] [c00bab2c] sys_fstatat64+0x88/0x90
[ec42bf40] [c0013aa8] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xff4fc1c
LR = 0xff4fbb0
Instruction dump:
93d80020 3d00c065 3d60c05d 39580008 381c007c 812842d0 80ebd00c 39290001
912842d0 39380010 90f80008 914bd00b <00000000> 91470004 916a0004 811c007c
---[ end trace 63d4f9f1d8c7a13d ]---
Oops: Exception in kernel mode, sig: 4 [#2]
PowerMac
Modules linked in: option radeon drm snd_aoa_codec_tas snd_aoa_fabric_layout b43 mac80211 snd_aoa cfg80211 pcmcia snd_aoa_i2sbus snd_pcm_oss snd_pcm snd_page_alloc snd_aoa_soundbus yenta_socket rsrc_nonstatic pcmcia_core ssb uninorth_agp agpgart ehci_hcd ohci_hcd
NIP: c00d01b4 LR: c00d0148 CTR: c01fbd8c
REGS: ee7b7bb0 TRAP: 0700 Tainted: G D (2.6.27-rc1-00158-g643fbd8)
MSR: 00089032 <EE,ME,IR,DR> CR: 28888482 XER: 00000000
TASK = eef1a090[2587] 'emacs' THREAD: ee7b6000
GPR00: ef80e67c ee7b7c60 eef1a090 00000000 c0f13738 f0000000 c0620000 ef616884
GPR08: c0650000 ebdf6b90 ebdf6b88 c05d0000 28888482 102e8e60 102e2180 102e0000
GPR16: ee7b7e5c fffffff4 c05d0000 ee7b7d40 ffffffd8 ee7b7d48 1033a630 ef403e94
GPR24: ebdf6b80 c0f0ae20 00008918 c0f13738 ef80e600 00000000 f0000000 ee7b7c60
NIP [c00d01b4] iget_locked+0xfc/0x148
LR [c00d0148] iget_locked+0x90/0x148
Call Trace:
[ee7b7c60] [c00d0148] iget_locked+0x90/0x148 (unreliable)
[ee7b7c90] [c00fd314] proc_get_inode+0x34/0x188
[ee7b7cb0] [c0104138] proc_lookup_de+0x88/0xf8
[ee7b7cd0] [c010467c] proc_lookup+0x34/0x4c
[ee7b7ce0] [c00fdf10] proc_root_lookup+0x30/0x64
[ee7b7d00] [c00c034c] do_lookup+0x1a4/0x220
[ee7b7d30] [c00c22ac] __link_path_walk+0x428/0xdd4
[ee7b7da0] [c00c2cb0] path_walk+0x58/0xe0
[ee7b7dd0] [c00c2e68] do_path_lookup+0x78/0x17c
[ee7b7e00] [c00c3db4] __path_lookup_intent_open+0x68/0xdc
[ee7b7e30] [c00c3e50] path_lookup_open+0x28/0x40
[ee7b7e40] [c00c40b0] do_filp_open+0xa4/0x7cc
[ee7b7f00] [c00b30d4] do_sys_open+0x6c/0x108
[ee7b7f30] [c00b31e4] sys_open+0x38/0x50
[ee7b7f40] [c0013aa8] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xf1d8560
LR = 0xf1ea534
Instruction dump:
93d80020 3d00c065 3d60c05d 39580008 381c007c 812842d0 80ebd00c 39290001
912842d0 39380010 90f80008 914bd00b <00000000> 91470004 916a0004 811c007c
---[ end trace 63d4f9f1d8c7a13d ]---
Oops: Exception in kernel mode, sig: 4 [#3]
PowerMac
Modules linked in: option radeon drm snd_aoa_codec_tas snd_aoa_fabric_layout b43 mac80211 snd_aoa cfg80211 pcmcia snd_aoa_i2sbus snd_pcm_oss snd_pcm snd_page_alloc snd_aoa_soundbus yenta_socket rsrc_nonstatic pcmcia_core ssb uninorth_agp agpgart ehci_hcd ohci_hcd
NIP: c00d01b4 LR: c00d0148 CTR: c01fbd8c
REGS: ee7b1be0 TRAP: 0700 Tainted: G D (2.6.27-rc1-00158-g643fbd8)
MSR: 00089032 <EE,ME,IR,DR> CR: 22288428 XER: 00000000
TASK = ee7307e0[2574] 'bash' THREAD: ee7b0000
GPR00: eee15e7c ee7b1c90 ee7307e0 00000000 c0f43e10 0037e752 ef306bb0 ef548e9c
GPR08: c0650000 e9d3f12c e9d3f124 c05d0000 22288422 100e5894 100e0000 100df49c
GPR16: ee7b1e2c fffffff4 c05d0000 ee7b1d60 ffffffd8 ee7b1d68 100dde04 ef4f7748
GPR24: e9d3f11c c0f0ae20 00038ff0 c0f43e10 eee15e00 00000000 0037e752 ee7b1c90
NIP [c00d01b4] iget_locked+0xfc/0x148
LR [c00d0148] iget_locked+0x90/0x148
Call Trace:
[ee7b1c90] [c00d0148] iget_locked+0x90/0x148 (unreliable)
[ee7b1cc0] [c011cd60] ext3_iget+0x24/0x53c
[ee7b1cf0] [c0120dbc] ext3_lookup+0x108/0x144
[ee7b1d20] [c00c034c] do_lookup+0x1a4/0x220
[ee7b1d50] [c00c22ac] __link_path_walk+0x428/0xdd4
[ee7b1dc0] [c00c2cb0] path_walk+0x58/0xe0
[ee7b1df0] [c00c2e68] do_path_lookup+0x78/0x17c
[ee7b1e20] [c00c3b58] user_path_at+0x64/0xa4
[ee7b1e90] [c00baa64] vfs_stat_fd+0x34/0x74
[ee7b1ec0] [c00bac2c] vfs_stat+0x30/0x48
[ee7b1ed0] [c00bac74] sys_stat64+0x30/0x5c
[ee7b1f40] [c0013aa8] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfece5e0
LR = 0x100671fc
Instruction dump:
93d80020 3d00c065 3d60c05d 39580008 381c007c 812842d0 80ebd00c 39290001
912842d0 39380010 90f80008 914bd00b <00000000> 91470004 916a0004 811c007c
---[ end trace 63d4f9f1d8c7a13d ]---

And then all my other windows started disappearing, so I figured it was
time to reboot.


In case anyone wants to disassemble it, I've uploaded the kernel to
http://ondioline.org/~paul/vmlinux-2.6.27-rc1-00158-g643fbd8 and the
config to http://ondioline.org/~paul/config-2.6.27-rc1-00158-g643fbd8

I've rebuilt a whole bunch of times in the course of this little
project, but the all four Oopses in this message are from the very
vmlinux linked above.

I have a couple of patches applied locally (a console font and a
Bluetooth HID quirk), so this is really Linus revision
94ad374a0751f40d25e22e036c37f7263569d24c.

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-05 04:34:34

by Michael Ellerman

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Tue, 2008-08-05 at 15:43 +1200, Paul Collins wrote:
> Michael Ellerman <[email protected]> writes:
>
> > On Mon, 2008-08-04 at 16:59 -0400, J. Bruce Fields wrote:
> >> On Tue, Aug 05, 2008 at 08:51:23AM +1200, Paul Collins wrote:
> >> > Michael Ellerman <[email protected]> writes:
> >> >
> >> > > On Mon, 2008-08-04 at 22:00 +1200, Paul Collins wrote:
> >> > >> Paul Collins <[email protected]> writes:
> >> > >>
> >> > >> > Neil Brown <[email protected]> writes:
> >> > >> >> Could you try removing the 'static' declaration for nfsd_acceptable
> >> > >> >> and recompile?
> >> > >> >> Or maybe try a different compiler?
> >> > >> >
> >> > >> > I will give these a try this evening.
> >> > >>
> >> > >> I built myself a nice new cross compiler:
> >> > >>
> >> > >> powerpc-linux-gnu-gcc-4.1 (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)
> >> > >>
> >> > >> and rebuilt 94ad374a0751f40d25e22e036c37f7263569d24c. Running that on
> >> > >> the server and 2.6.26 on the client, I got yet another Oops. This one
> >> > >> locked the machine up pretty good, so all I have is a picture:
> >> > >>
> >> > >> http://ondioline.org/~paul/DSCN1608.JPG
> >> > >
> >> > > Wow.
> >> > >
> >> > > Can you try building a kernel on the server? ie. not over NFS.
> >> >
> >> > Built kernels on the server with native gcc 4.2.4 and 4.3.1 and repeated
> >> > the build test.
> >>
> >> But the build test itself was over nfs? (And you can't reproduce the
> >> same problem without nfs?)
> >
> > Yeah, I'm not clear on that either. What I was aiming at was can you get
> > it to oops somewhere else by not building over NFS - in which case we
> > can rule NFS (more or less) out.
>
> I think may be able to rule NFS out now. I just got this Oops when Xorg
> started on boot.

Cool, that looks fairly convincing.

> In case anyone wants to disassemble it, I've uploaded the kernel to
> http://ondioline.org/~paul/vmlinux-2.6.27-rc1-00158-g643fbd8 and the
> config to http://ondioline.org/~paul/config-2.6.27-rc1-00158-g643fbd8
>
> I've rebuilt a whole bunch of times in the course of this little
> project, but the all four Oopses in this message are from the very
> vmlinux linked above.
>
> I have a couple of patches applied locally (a console font and a
> Bluetooth HID quirk), so this is really Linus revision
> 94ad374a0751f40d25e22e036c37f7263569d24c.

And you're _sure_ none of them has a "break-everything" hunk in it? :)


I see you have FTRACE enabled. That's new and could potentially bugger
things up without the compiler knowing, so can you turn that off.

And can you enable CONFIG_CODE_PATCHING_SELFTEST and
CONFIG_FTR_FIXUP_SELFTEST, that will enable tests of some code I changed
that /could/ (maybe) cause random blow ups.

Also, how old is the machine? Any chance you're just seeing random
memory corruption?

cheers

--
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2008-08-05 04:47:32

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Michael Ellerman <[email protected]> writes:

> On Tue, 2008-08-05 at 15:43 +1200, Paul Collins wrote:
>> Michael Ellerman <[email protected]> writes:
>> I think may be able to rule NFS out now. I just got this Oops when Xorg
>> started on boot.
>
> Cool, that looks fairly convincing.
>
>> In case anyone wants to disassemble it, I've uploaded the kernel to
>> http://ondioline.org/~paul/vmlinux-2.6.27-rc1-00158-g643fbd8 and the
>> config to http://ondioline.org/~paul/config-2.6.27-rc1-00158-g643fbd8
>>
>> I've rebuilt a whole bunch of times in the course of this little
>> project, but the all four Oopses in this message are from the very
>> vmlinux linked above.
>>
>> I have a couple of patches applied locally (a console font and a
>> Bluetooth HID quirk), so this is really Linus revision
>> 94ad374a0751f40d25e22e036c37f7263569d24c.
>
> And you're _sure_ none of them has a "break-everything" hunk in it? :)

Pretty sure! Here's a diffstat:

$ git diff --stat HEAD^^..
drivers/video/console/Kconfig | 12 +
drivers/video/console/Makefile | 2 +
drivers/video/console/font_neepalt10x20.c | 5392 ++++++++++++++++++++++++
drivers/video/console/font_neepalt12x24.c | 6416 +++++++++++++++++++++++++++++
drivers/video/console/fonts.c | 8 +
include/linux/font.h | 6 +-
net/bluetooth/hidp/core.c | 6 +
7 files changed, 11841 insertions(+), 1 deletions(-)

> I see you have FTRACE enabled. That's new and could potentially bugger
> things up without the compiler knowing, so can you turn that off.
>
> And can you enable CONFIG_CODE_PATCHING_SELFTEST and
> CONFIG_FTR_FIXUP_SELFTEST, that will enable tests of some code I changed
> that /could/ (maybe) cause random blow ups.

I'll try these out.

> Also, how old is the machine? Any chance you're just seeing random
> memory corruption?

It's about four years old. It was in storage for about six months and I
got it repaired a few weeks ago (display cable and inverter). The sort
of crazy crap I've been reporting certainly smacks of memory corruption.
But on the other hand, 2.6.25 (Debian's) and 2.6.26 (my own) have been
trouble-free.

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-05 07:17:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Tue, 2008-08-05 at 16:47 +1200, Paul Collins wrote:
> It's about four years old. It was in storage for about six months and I
> got it repaired a few weeks ago (display cable and inverter). The sort
> of crazy crap I've been reporting certainly smacks of memory corruption.
> But on the other hand, 2.6.25 (Debian's) and 2.6.26 (my own) have been
> trouble-free.

Any chance you can bisect the problem ?

Cheers,
Ben.

2008-08-05 09:43:47

by Paul Collins

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Michael Ellerman <[email protected]> writes:

> I see you have FTRACE enabled. That's new and could potentially bugger
> things up without the compiler knowing, so can you turn that off.

With FTRACE disabled, doing cross-builds from the 2.6.26 amd64 client, a
setup that normally triggers the problem on the 2nd or 3rd build, I was
able to do 10 complete builds. ("make clean oldconfig vmlinux modules")

So it looks like ftrace is the cause, or at least provokes some other
usually-latent problem. I wasn't using it, so I'll just leave it off.

> And can you enable CONFIG_CODE_PATCHING_SELFTEST and
> CONFIG_FTR_FIXUP_SELFTEST, that will enable tests of some code I changed
> that /could/ (maybe) cause random blow ups.

With those options enabled, I get this:

Running code patching self-tests ...
Running feature fixup self-tests ...

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-08-05 11:53:52

by Michael Ellerman

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Tue, 2008-08-05 at 21:43 +1200, Paul Collins wrote:
> Michael Ellerman <[email protected]> writes:
>
> > I see you have FTRACE enabled. That's new and could potentially bugger
> > things up without the compiler knowing, so can you turn that off.
>
> With FTRACE disabled, doing cross-builds from the 2.6.26 amd64 client, a
> setup that normally triggers the problem on the 2nd or 3rd build, I was
> able to do 10 complete builds. ("make clean oldconfig vmlinux modules")
>
> So it looks like ftrace is the cause, or at least provokes some other
> usually-latent problem. I wasn't using it, so I'll just leave it off.

OK, that's sort of good, but also not. I can't see anything in the
ftrace code that explains it, but I guess it's lurking. We'll try and
reproduce locally and bang on it.

Thanks for chasing it, and let us know if you get an oops with
CONFIG_FTRACE=n.

> > And can you enable CONFIG_CODE_PATCHING_SELFTEST and
> > CONFIG_FTR_FIXUP_SELFTEST, that will enable tests of some code I changed
> > that /could/ (maybe) cause random blow ups.
>
> With those options enabled, I get this:
>
> Running code patching self-tests ...
> Running feature fixup self-tests ...

That's good, they only print if they fail.

cheers

--
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2008-08-06 06:42:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

On Tue, 2008-08-05 at 17:16 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2008-08-05 at 16:47 +1200, Paul Collins wrote:
> > It's about four years old. It was in storage for about six months and I
> > got it repaired a few weeks ago (display cable and inverter). The sort
> > of crazy crap I've been reporting certainly smacks of memory corruption.
> > But on the other hand, 2.6.25 (Debian's) and 2.6.26 (my own) have been
> > trouble-free.
>
> Any chance you can bisect the problem ?

Ok, so I can reproduce on a few 32 bits configs with ftrace enabled.

Looks like some non volatile GPRs get corrupted. I don't know yet if
ftrace is the culprit though, I couldn't find anything obviously wrong
with the mcount implementation we have.

It looks like the corrupted GPR has been saved/restored on the stack
and that the corruption is due to the stack itself being written
to. It's not clear by whome though and in what circumstances.

We'll have to dig more.

Cheers,
Ben.

2008-08-25 19:59:54

by Bill Davidsen

[permalink] [raw]
Subject: Re: nfsd, v4: oops in find_acceptable_alias, ppc32 Linux, post-2.6.27-rc1

Paul Collins wrote:
> Michael Ellerman <[email protected]> writes:

>> Also, how old is the machine? Any chance you're just seeing random
>> memory corruption?
>
> It's about four years old. It was in storage for about six months and I
> got it repaired a few weeks ago (display cable and inverter). The sort
> of crazy crap I've been reporting certainly smacks of memory corruption.
> But on the other hand, 2.6.25 (Debian's) and 2.6.26 (my own) have been
> trouble-free.
>
While it is possible that the new kernel tickles a hardware bug and all
the old ones don't, I would put my $ on the kernel being the issue.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot